Tutorial: Let's build a Compiler! - Part XVI: Unit Construction

Jon A. Lambert jlsysinc at ix.netcom.com
Wed Mar 4 23:23:53 CET 1998


                       LET'S BUILD A COMPILER! 
 
                                 By 
 
                       Jack W. Crenshaw, Ph.D. 
 
                            29 May, 1995 
 
                     Part 16: UNIT CONSTRUCTION 
 


***************************************************************** 
*                                                               * 
*                        COPYRIGHT NOTICE                       * 
*                                                               * 
*   Copyright (C) 1995 Jack W. Crenshaw. All rights reserved.   * 
*                                                               * 
***************************************************************** 
 
INTRODUCTION 
 
This series of tutorials promises to be perhaps one of the longest-
running mini-series in history, rivalled only by the delay in Volume IV 
of Knuth.  Begun in 1988, the series ran into a four-year hiatus in 1990 
when the "cares of this world," changes in priorities and interests, and 
the need to make a living seemed to stall it out after Installment 14.  
Those of you with loads of patience were finally rewarded, in the spring 
of last year, with the long-awaited Installment 15.  In it, I began to 
try to steer the series back on track, and in the process, to make it 
easier to continue on to the goal, which is to provide you with not only 
enough understanding of the difficult subject of compiler theory, but 
also enough tools, in the form of canned subroutines and concepts, so 
that you would be able to continue on your own and become proficient 
enough to build your own parsers and translators.  Because of that long 
hiatus, I thought it appropriate to go back and review the concepts we 
have covered so far, and to redo some of the software, as well.  In the 
past, we've never concerned ourselves much with the development of 
production-quality software tools ... after all, I was trying to teach 
(and learn) concepts, not production practice.  To do that, I tended to 
give you, not complete compilers or parsers, but only those snippets of 
code that illustrated the particular point we were considering at the 
moment. 
 
I still believe that's a good way to learn any subject; no one wants to 
have to make changes to 100,000 line programs just to try out a new 
idea.  But the idea of just dealing with code snippets, rather than 
complete programs, also has its drawbacks in that we often seemed to be 
writing the same code fragments over and over.  Although repetition has 
been thoroughly proven to be a good way to learn new ideas, it's also 
true that one can have too much of a good thing.  By the time I had 
completed Installment 14 I seemed to have reached the limits of my 
abilities to juggle multiple files and multiple versions of the same 
software functions.  Who knows, perhaps that's one reason I seemed to 
have run out of gas at that point. 
 
Fortunately, the later versions of Borland's Turbo Pascal allow us to 
have our cake and eat it too.  By using their concept of separately 
compilable units, we can still write small subroutines and functions, 
and keep our main programs and test programs small and simple.  But, 
once written, the code in the Pascal units will always be there for us 
to use, and linking them in is totally painless and transparent. 
 
Since, by now, most of you are programming in either C or C++, I know 
what you're thinking:  Borland, with their Turbo Pascal (TP), certainly 
didn't invent the concept of separately compilable modules.  And of 
course you're right.  But if you've not used TP lately, or ever, you may 
not realize just how painless the whole process is.  Even in C or C++, 
you still have to build a make file, either manually or by telling the 
compiler how to do so.  You must also list, using "extern" statements or 
header files, the functions you want to import.  In TP, you don't even 
have to do that.  You need only name the units you wish to use, and all 
of their procedures automatically become available.   
 
 
It's not my intention to get into a language-war debate here, so I won't 
pursue the subject any further.  Even I no longer use Pascal on my job 
... I use C at work and C++ for my articles in Embedded Systems 
Programming and other magazines.  Believe me, when I set out to 
resurrect this series, I thought long and hard about switching both 
languages and target systems to the ones that we're all using these 
days, C/C++ and PC architecture, and possibly object-oriented methods as 
well.  In the end, I felt it would cause more confusion than the hiatus 
itself has. And after all, Pascal still remains one of the best possible 
languages for teaching, not to mention production programming.  Finally, 
TP still compiles at the speed of light, much faster than competing 
C/C++ compilers. And Borland's smart linker, used in TP but not in their 
C++ products, is second to none.  Aside from being much faster than 
Microsoft-compatible linkers, the Borland smart linker will cull unused 
procedures and data items, even to the extent of trimming them out of 
defined objects if they're not needed.  For one of the few times in our 
lives, we don't have to compromise between completeness and efficiency.  
When we're writing a TP unit, we can make it as complete as we like, 
including any member functions and data items we may think we will ever 
need, confident that doing so will not create unwanted bloat in the 
compiled and linked executable. 
 
The point, really, is simply this:  By using TP's unit mechanism, we can 
have all the advantages and convenience of writing small, seemingly 
stand-alone test programs, without having to constantly rewrite the 
support functions that we need.  Once written, the TP units sit there, 
quietly waiting to do their duty and give us the support we need, when 
we need it. 
 
Using this principle, in Installment 15 I set out to minimize our 
tendency to re-invent the wheel by organizing  our code into separate 
Turbo Pascal units, each containing different parts of the compiler.  We 
ended up with the following units: 
 
* Input 
* Output 
* Errors 
* Scanner 
* Parser 
* CodeGen 
 
Each of these units serves a different function, and encapsulates 
specific areas of functionality.  The Input and Output units, as their 
name implies, provide character stream I/O and the all-important 
lookahead character upon which our predictive parser is based.  The 
Errors unit, of course, provides standard error handling.  The Scanner 
unit contains all of our boolean functions such as IsAlpha, and the 
routines GetName and GetNumber, which process multi-character tokens. 
 
The two units we'll be working with the most, and the ones that most 
represent the personality of our compiler, are Parser and CodeGen.  
Theoretically, the Parser unit should encapsulate all aspects of the 
compiler that depend on the syntax of the compiled language (though, as 
we saw last time, a small amount of this syntax spills over into 
Scanner).  Similarly, the code generator unit, CodeGen, contains all of 
the code dependent upon the target machine.  In this installment, we'll 
be continuing with the development of the functions in these two all-
important units. 
 
 


JUST LIKE CLASSICAL? 
 
Before we proceed, however, I think I should clarify the relationship  
between, and the functionality of these units.  Those of you who are 
familiar with compiler theory as taught in universities will, of course, 
recognize the names, Scanner, Parser, and CodeGen, all of which are 
components of a classical compiler implementation.  You may be thinking 
that I've abandoned my commitment to the KISS philosophy, and drifted 
towards a more conventional architecture than we once had.  A closer 
look, however, should convince you that, while the names are similar, 
the functionalities are quite different. 
 
Together, the scanner and parser of a classical implementation comprise 
the so-called "front end," and the code generator, the back end.  The 
front end routines process the language-dependent, syntax-related 
aspects of the source language, while the code generator, or back end, 
deals with the target machine-dependent parts of the problem.  In 
classical compilers, the two ends communicate via a file of instructions 
written in an intermediate language (IL). 
 
Typically, a classical scanner is a single procedure, operating as a co-
procedure with the parser.  It "tokenizes" the source file, reading it 
character by character, recognizing language elements, translating them 
into tokens, and passing them along to the parser.  You can think of the 
parser as an abstract machine, executing "op codes," which are the 
tokens.  Similarly, the parser generates op codes of a second abstract 
machine, which mechanizes the IL.  Typically, the IL file is written to 
disk by the parser, and read back again by the code generator. 
 
Our organization is quite different.  We have no lexical scanner, in the 
classical sense;  our unit Scanner, though it has a similar name, is not 
a single procedure or co-procedure, but merely a set of separate 
subroutines which are called by the parser as needed.  
 
Similarly, the classical code generator, the back end,  is a translator 
in its own right, reading an IL "source" file, and emitting an object 
file.  Our code generator doesn't work that way.  In our compiler, there 
IS no intermediate language; every construct in the source language 
syntax is converted into assembly language as it is recognized by the 
parser.  Like Scanner, the unit CodeGen consists of individual 
procedures which are called by the parser as needed. 
 
This "code 'em as you find 'em" philosophy may not produce the world's 
most efficient code -- for example, we haven't provided (yet!) a 
convenient place for an optimizer to work its magic -- but it sure does 
simplify the compiler, doesn't it? 
 
And that observation prompts me to reflect, once again, on how we have 
managed to reduce a compiler's functions to such comparatively simple 
terms.  I've waxed eloquent on this subject in past installments, so I 
won't belabor the point too much here.  However, because of the time 
that's elapsed since those last soliloquies, I hope you'll grant me just 
a little time to remind myself, as well as you, how we got here.  We got 
here by applying several principles that writers of commercial compilers 
seldom have the luxury of using.  These are: 
 
o The KISS philosophy -- Never do things the hard way without a 
reason 
 
o Lazy coding -- Never put off until tomorrow what you can put 
of forever (with credits to P.J. Plauger) 
 
o Skepticism -- Stubborn refusal to do something just because 
that's the way it's always been done. 
 
o Acceptance of inefficient code 
 
o Rejection of arbitrary constraints 
 
As I've reviewed the history of compiler construction, I've learned that 
virtually every production compiler in history has suffered from pre-
imposed conditions that strongly influenced its design. The original 
FORTRAN compiler of John Backus, et al, had to compete with assembly 
language, and therefore was constrained to produce extremely efficient 
code.  The IBM compilers for the minicomputers of the 70's had to run in 
the very small RAM memories then available -- as small as 4k.  The early 
Ada compiler had to compile itself.  Per Brinch Hansen decreed that his 
Pascal compiler developed for the IBM PC must execute in a 64k machine.  
Compilers developed in Computer Science courses had to compile the 
widest variety of languages, and therefore required LALR parsers. 
 
In each of these cases, these preconceived constraints literally 
dominated the design of the compiler.  
 
A good example is Brinch Hansen's compiler, described in his excellent 
book, "Brinch Hansen on Pascal Compilers" (highly recommended).  Though 
his compiler is one of the most clear and un-obscure compiler 
implementations I've seen, that one decision, to compile large files in 
a small RAM, totally drives the design, and he ends up with not just 
one, but many intermediate files, together with the drivers to write and 
read them. 
 
In time, the architectures resulting from such decisions have found 
their way into computer science lore as articles of faith. In this one 
man's opinion, it's time that they were re-examined critically.  The 
conditions, environments, and requirements that led to classical 
architectures are not the same as the ones we have today.  There's no 
reason to believe the solutions should be the same, either. 
 
In this tutorial, we've followed the leads of such pioneers in the world 
of small compilers for Pcs as Leor Zolman, Ron Cain, and James Hendrix, 
who didn't know enough compiler theory to know that they "couldn't do it 
that way."  We have resolutely refused to accept arbitrary constraints, 
but rather have done whatever was easy.  As a result, we have evolved an 
architecture that, while quite different from the classical one, gets 
the job done in very simple and straightforward fashion. 
 
I'll end this philosophizing with an observation re the notion of an 
intermediate language.  While I've noted before that we don't have one 
in our compiler, that's not exactly true; we _DO_ have one, or at least 
are evolving one, in the sense that we are defining code generation 
functions for the parser to call.  In essence, every call to a code 
generation procedure can be thought of as an instruction in an 
intermediate language.  Should we ever find it necessary to formalize an 
intermediate language, this is the way we would do it:  emit codes from 
the parser, each representing a call to one of the code generator 
procedures, and then process each code by calling those procedures in a 
separate pass, implemented in a back end. Frankly, I don't see that 
we'll ever find a need for this approach, but there is the connection, 
if you choose to follow it, between the classical and the current 
approaches. 
 


FLESHING OUT THE PARSER 
 
Though I promised you, somewhere along about Installment 14, that we'd 
never again write every single function from scratch, I ended up 
starting to do just that in Installment 15.  One reason: that long 
hiatus between the two installments made a review seem eminently 
justified ... even imperative, both for you and for me. More 
importantly, the decision to collect the procedures into modules 
(units), forced us to look at each one yet again, whether we wanted to 
or not.  And, finally and frankly, I've had some new ideas in the last 
four years that warranted a fresh look at some old friends.  When I 
first began this series, I was frankly amazed, and pleased, to learn 
just how simple parsing routines can be made.  But this last time 
around, I've surprised myself yet again, and been able to make them just 
that last little bit simpler, yet. 
 
Still, because of this total rewrite of the parsing modules, I was only 
able to include so much in the last installment.  Because of this, our 
hero, the parser, when last seen, was a shadow of his former self,  
consisting of only enough code to parse and process a factor consisting 
of either a variable or a constant.  The main effort of this current 
installment will be to help flesh out the parser to its former glory.  
In the process, I hope you'll bear with me if we sometimes cover ground 
we've long since been over and dealt with. 
 
First, let's take care of a problem that we've addressed before: Our 
current version of procedure Factor, as we left it in Installment 15,  
can't handle negative arguments.  To fix that, we'll introduce the 
procedure SignedFactor: 
 
 
{--------------------------------------------------------------} 
{ Parse and Translate a Factor with Optional Sign } 
 
procedure SignedFactor; 
var Sign: char; 
begin 
 Sign := Look; 
 if IsAddop(Look) then 
  GetChar; 
 Factor; 
 if Sign = '-' then Negate; 
end; 
{--------------------------------------------------------------}  
 
 
Note that this procedure calls a new code generation routine, Negate: 
 
 
{--------------------------------------------------------------} 
{ Negate Primary } 
 
procedure Negate; 
begin 
 EmitLn('NEG D0'); 
end; 
{--------------------------------------------------------------} 
 
 
(Here, and elsewhere in this series, I'm only going to show you the new 
routines. I'm counting on you to put them into the proper unit, which 
you should normally have no trouble identifying.  Don't forget to add 
the procedure's prototype to the interface section of the unit.) 
 
In the main program, simply change the procedure called from Factor to 
SignedFactor, and give the code a test.  Isn't it neat how the Turbo 
linker and make facility handle all the details? 
 
Yes, I know, the code isn't very efficient.  If we input a number, -3, 
the generated code is: 
 
 MOVE #3,D0 
 NEG D0 
 
which is really, really dumb.  We can do better, of course, by simply 
pre-appending a minus sign to the string passed to LoadConstant, but it 
adds a few lines of code to SignedFactor, and I'm applying the KISS 
philosophy very aggressively here. What's more,  to tell the truth, I 
think I'm subconsciously enjoying generating "really, really dumb" code, 
so I can have the pleasure of watching it get dramatically better when 
we get into optimization methods. 
 
Most of you have never heard of John Spray, so allow me to introduce him 
to you here.  John's from New Zealand, and used to teach computer 
science at one of its universities.  John wrote a compiler for the 
Motorola 6809, based on a delightful, Pascal-like language of his own 
design called "Whimsical."  He later ported the compiler to the 68000, 
and for awhile it was the only compiler I had for my homebrewed 68000 
system.   
 
For the record, one of my standard tests for any new compiler is to see 
how the compiler deals with a null program like: 
 
 program main; 
 begin 
 end. 
 
My test is to measure the time required to compile and link, and the 
size of the object file generated.  The undisputed _LOSER_ in the test 
is the DEC C compiler for the VAX, which took 60 seconds to compile, on 
a VAX 11/780, and generated a 50k object file.  John's compiler is the 
undisputed, once, future, and forever king in the code size department.  
Given the null program, Whimsical generates precisely two bytes of code, 
implementing the one instruction, 
 
 RET 
 
By setting a compiler option to generate an include file rather than a 
standalone program, John can even cut this size, from two bytes to zero!  
Sort of hard to beat a null object file, wouldn't you say? 
 
Needless to say, I consider John to be something of an expert on code 
optimization, and I like what he has to say: "The best way to optimize 
is not to have to optimize at all, but to produce good code in the first 
place." Words to live by.  When we get started on optimization, we'll 
follow John's advice, and our first step will not be to add a peephole 
optimizer or other after-the-fact device, but to improve the quality of 
the code emitted before optimization.  So make a note of SignedFactor as 
a good first candidate for attention, and for now we'll leave it be. 
 
TERMS AND EXPRESSIONS 
 
I'm sure you know what's coming next: We must, yet again, create the 
rest of the procedures that implement the recursive-descent parsing of 
an expression.  We all know that the hierarchy of procedures for 
arithmetic expressions is: 
 
expression 
 term 
  factor 
 
However, for now let's continue to do things one step at a time, 
and consider only expressions with additive terms in them.  The 
code to implement expressions, including a possibly signed first 
term, is shown next: 
 
 
{--------------------------------------------------------------} 
{ Parse and Translate an Expression } 
 
procedure Expression; 
begin 
 SignedFactor; 
 while IsAddop(Look) do 
  case Look of 
   '+': Add; 
   '-': Subtract; 
  end; 
end; 
{--------------------------------------------------------------} 
 
 
This procedure calls two other procedures to process the 
operations: 
 
 
{--------------------------------------------------------------} 
{ Parse and Translate an Addition Operation } 
 
procedure Add; 
begin 
 Match('+'); 
 Push; 
 Factor; 
 PopAdd; 
end; 
 
 
{--------------------------------------------------------------} 
{ Parse and Translate a Subtraction Operation } 
 
procedure Subtract; 
begin 
 Match('-'); 
 Push; 
 Factor; 
 PopSub; 
end; 
{--------------------------------------------------------------} 
 
 
The three procedures Push, PopAdd, and PopSub are new code generation 
routines.  As the name implies, procedure Push generates code to push 
the primary register (D0, in our 68000 implementation) to the stack.  
PopAdd and PopSub pop the top of the stack again, and add it to, or 
subtract it from, the primary register.  The code is shown next: 
 
 


{--------------------------------------------------------------} 
{ Push Primary to Stack } 
 
procedure Push; 
begin 
 EmitLn('MOVE D0,-(SP)'); 
end; 
 
{--------------------------------------------------------------} 
{ Add TOS to Primary } 
 
procedure PopAdd; 
begin 
 EmitLn('ADD (SP)+,D0'); 
end; 
 
{--------------------------------------------------------------} 
{ Subtract TOS from Primary } 
 
procedure PopSub; 
begin 
 EmitLn('SUB (SP)+,D0'); 
 Negate; 
end; 
{--------------------------------------------------------------} 
 
 
Add these routines to Parser and CodeGen, and change the main program to 
call Expression. Voila! 
 
The next step, of course, is to add the capability for dealing with 
multiplicative terms.  To that end, we'll add a procedure Term, and code 
generation procedures PopMul and PopDiv.  These code generation 
procedures are shown next: 
 
 
{--------------------------------------------------------------} 
{ Multiply TOS by Primary } 
 
procedure PopMul; 
begin 
 EmitLn('MULS (SP)+,D0'); 
end; 
 
{--------------------------------------------------------------} 
{ Divide Primary by TOS } 
 
procedure PopDiv; 
begin 
 EmitLn('MOVE (SP)+,D7'); 
 EmitLn('EXT.L D7'); 
 EmitLn('DIVS D0,D7'); 
 EmitLn('MOVE D7,D0'); 
end; 
{--------------------------------------------------------------} 
 
 
I admit, the division routine is a little busy, but there's no help for 
it.  Unfortunately, while the 68000 CPU allows a division using the top 
of stack (TOS), it wants the arguments in the wrong order, just as it 
does for subtraction.  So our only recourse is to pop the stack to a 
scratch register (D7), perform the division there, and then move the 
result back to our primary register, D0. Note the use of signed multiply 
and divide operations.  This follows an implied, but unstated, 
assumption, that all our variables will be signed 16-bit integers. This 
decision will come back to haunt us later, when we start looking at 
multiple data types, type conversions, etc. 
 
Our procedure Term is virtually a clone of Expression, and looks like 
this: 
 
 
{--------------------------------------------------------------} 
{ Parse and Translate a Term } 
 
procedure Term; 
begin 
 Factor; 
 while IsMulop(Look) do 
  case Look of 
   '*': Multiply; 
   '/': Divide; 
  end; 
end; 
{--------------------------------------------------------------} 
 
 
Our next step is to change some names.  SignedFactor now becomes 
SignedTerm, and the calls to Factor in Expression, Add, Subtract and 
SignedTerm get changed to call Term: 
 
 
{--------------------------------------------------------------} 
{ Parse and Translate a Term with Optional Leading Sign } 
 
procedure SignedTerm; 
var Sign: char; 
begin 
 Sign := Look; 
 if IsAddop(Look) then 
  GetChar; 
 Term; 
 if Sign = '-' then Negate; 
end; 
{--------------------------------------------------------------} 
... 
{--------------------------------------------------------------} 
{ Parse and Translate an Expression } 
 
procedure Expression; 
begin 
 SignedTerm; 
 while IsAddop(Look) do 
  case Look of 
   '+': Add; 
   '-': Subtract; 
  end; 
end; 
{--------------------------------------------------------------} 
 
 
If memory serves me correctly, we once had BOTH a procedure SignedFactor 
and a procedure SignedTerm. I had reasons for doing that at the time ... 
they had to do with the handling of Boolean algebra and, in particular, 
the Boolean "not" function.  But certainly, for arithmetic operations, 
that duplication isn't necessary.  In an expression like: 
 
 -x*y 
 
it's very apparent that the sign goes with the whole TERM, x*y, and not 
just the factor x, and that's the way Expression is coded.   
 
Test this new code by executing Main.  It still calls Expression, so you 
should now be able to deal with expressions containing any of the four 
arithmetic operators. 
 
Our last bit of business, as far as expressions goes, is to modify 
procedure Factor to allow for parenthetical expressions.  By using a 
recursive call to Expression, we can reduce the needed code to virtually 
nothing.  Five lines added to Factor do the job: 
 
 
{--------------------------------------------------------------} 
{ Parse and Translate a Factor } 
 
procedure Factor; 
begin 
 if Look ='(' then begin 
  Match('('); 
  Expression; 
  Match(')'); 
  end 
 else if IsDigit(Look) then 
  LoadConstant(GetNumber) 
 else if IsAlpha(Look)then 
  LoadVariable(GetName) 
 else 
  Error('Unrecognized character ' + Look); 
end; 
{--------------------------------------------------------------} 
 
 
At this point, your "compiler" should be able to handle any legal 
expression you can throw at it.  Better yet, it should reject all 
illegal ones! 
 
ASSIGNMENTS 
 
As long as we're this close, we might as well create the code to deal 
with an assignment statement.  This code needs only to remember the name 
of the target variable where we are to store the result of an 
expression, call Expression, then store the number.  The procedure is 
shown next: 
 
 
{--------------------------------------------------------------} 
{ Parse and Translate an Assignment Statement } 
 
procedure Assignment; 
var Name: string; 
begin 
 Name := GetName; 
 Match('='); 
 Expression; 
 StoreVariable(Name); 
end; 
{--------------------------------------------------------------} 
 
The assignment calls for yet another code generation routine: 
 


{--------------------------------------------------------------} 
{ Store the Primary Register to a Variable } 
 
procedure StoreVariable(Name: string); 
begin 
 EmitLn('LEA ' + Name + '(PC),A0'); 
 EmitLn('MOVE D0,(A0)'); 
end; 
{--------------------------------------------------------------} 
 
 
Now, change the call in Main to call Assignment, and you should see a 
full assignment statement being processed correctly.  Pretty neat, eh?  
And painless, too. 
 
In the past, we've always tried to show BNF relations to define the 
syntax we're developing. I haven't done that here, and it's high time I 
did.  Here's the BNF: 
 
 
<factor>      ::= <variable> | <constant> | '(' <expression> ')'  
<signed_term> ::= [<addop>] <term> 
<term>        ::= <factor> (<mulop> <factor>)*  
<expression>  ::= <signed_term> (<addop> <term>)* 
<assignment>  ::= <variable> '=' <expression> 
 
BOOLEANS 
 
The next step, as we've learned several times before, is to add Boolean 
algebra.  In the past, this step has at least doubled the amount of code 
we've had to write.  As I've gone over this step in my mind, I've found 
myself diverging more and more from what we did in previous 
installments.  To refresh your memory, I noted that Pascal treats the 
Boolean operators pretty much identically to the way it treats 
arithmetic ones.  A Boolean "and" has the same precedence level as 
multiplication, and the "or" as addition.  C, on the other hand, sets 
them at different precedence levels, and all told has a whopping 17 
levels.  In our earlier work, I chose something in between, with seven 
levels.  As a result, we ended up with things called Boolean 
expressions, paralleling in most details the arithmetic expressions, but 
at a different precedence level.  All of this, as it turned out, came 
about because I didn't like having to put parentheses around the Boolean 
expressions in statements like: 
 
      IF (c >= 'A') and (c <= 'Z') then ... 
 
In retrospect, that seems a pretty petty reason to add many layers of 
complexity to the parser.  Perhaps more to the point, I'm not sure I was 
even able to avoid the parens.   
 
For kicks, let's start anew, taking a more Pascal-ish approach, and just 
treat the Boolean operators at the same precedence level as the 
arithmetic ones. We'll see where it leads us.  If it seems to be down 
the garden path, we can always backtrack to the earlier approach. 
 
For starters, we'll add the "addition-level" operators to Expression. 
That's easily done; first, modify the function IsAddop in unit Scanner 
to include two extra operators: '|' for "or," and '~' for "exclusive 
or": 
 
 


{--------------------------------------------------------------} 
function IsAddop(c: char): boolean; 
begin 
 IsAddop := c in ['+','-', '|', '~']; 
end; 
{--------------------------------------------------------------} 
 
 
Next, we must include the parsing of the operators in procedure 
Expression: 
 
 
{--------------------------------------------------------------} 
procedure Expression; 
begin 
 SignedTerm; 
 while IsAddop(Look) do 
  case Look of 
   '+': Add; 
   '-': Subtract; 
   '|': _Or; 
   '~': _Xor; 
  end; 
{--------------------------------------------------------------} 
end; 
 
 
(The underscores are needed, of course, because "or" and "xor" are 
reserved words in Turbo Pascal.) 
 
Next, the procedures _Or and _Xor: 
 
 
{--------------------------------------------------------------} 
{ Parse and Translate a Subtraction Operation } 
 
procedure _Or; 
begin 
 Match('|'); 
 Push; 
 Term; 
 PopOr; 
end; 
 
{--------------------------------------------------------------} 
{ Parse and Translate a Subtraction Operation } 
 
procedure _Xor; 
begin 
 Match('~'); 
 Push; 
 Term; 
 PopXor; 
end; 
{--------------------------------------------------------------} 
 
And, finally, the new code generator procedures: 
 
 


{--------------------------------------------------------------} 
{ Or TOS with Primary } 
 
procedure PopOr; 
begin 
 EmitLn('OR (SP)+,D0'); 
end; 
 
{--------------------------------------------------------------} 
{ Exclusive-Or TOS with Primary } 
 
procedure PopXor; 
begin 
 EmitLn('EOR (SP)+,D0'); 
end; 
{--------------------------------------------------------------} 
 
Now, let's test the translator (you might want to change the call 
in Main back to a call to Expression, just to avoid having to type 
"x=" for an assignment every time). 
 
So far, so good.  The parser nicely handles expressions of the 
form: 
 
 x|y~z 
 
Unfortunately, it also does nothing to protect us from mixing 
Boolean and arithmetic algebra.  It will merrily generate code 
for: 
 
 (a+b)*(c~d) 
 
We've talked about this a bit, in the past.  In general the rules 
for what operations are legal or not cannot be enforced by the 
parser itself, because they are not part of the syntax of the 
language, but rather its semantics.  A compiler that doesn't allow 
mixed-mode expressions of this sort must recognize that c and d 
are Boolean variables, rather than numeric ones, and balk at 
multiplying them in the next step. But this "policing" can't be 
done by the parser; it must be handled somewhere between the 
parser and the code generator. We aren't in a position to enforce 
such rules yet, because we haven't got either a way of declaring 
types, or a symbol table to store the types in.  So, for what 
we've got to work with at the moment, the parser is doing 
precisely what it's supposed to do. 
 
Anyway, are we sure that we DON'T want to allow mixed-type 
operations?  We made the decision some time ago (or, at least, I 
did) to adopt the value 0000 as a Boolean "false," and -1, or 
FFFFh, as a Boolean "true."  The nice part about this choice is 
that bitwise operations work exactly the same way as logical ones.  
In other words, when we do an operation on one bit of a logical 
variable, we do it on all of them.  This means that we don't need 
to distinguish between logical and bitwise operations, as is done 
in C with the operators & and &&, and | and ||.  Reducing the 
number of operators by half certainly doesn't seem all bad. 
 
>From the point of view of the data in storage, of course, the 
computer and compiler couldn't care less whether the number FFFFh 
represents the logical TRUE, or the numeric -1.  Should we?  I 
sort of think not.  I can think of many examples (though they 
might be frowned upon as "tricky" code) where the ability to mix 
the types might come in handy.  Example, the Dirac delta function, 
which could be coded in one simple line: 
 
 -(x=0) 
 
or the absolute value function (DEFINITELY tricky code!): 
 
 x*(1+2*(x<0)) 
 
Please note, I'm not advocating coding like this as a way of life.  
I'd almost certainly write these functions in more readable form,  
using IFs, just to keep from confusing later maintainers.  Still, 
a moral question arises:  Do we have the right to ENFORCE our 
ideas of good coding practice on the programmer, but writing the 
language so he can't do anything else?  That's what Nicklaus Wirth 
did, in many places in Pascal, and Pascal has been criticized for 
it -- for not being as "forgiving" as C.   
 
An interesting parallel presents itself in the example of the 
Motorola 68000 design.  Though Motorola brags loudly about the 
orthogonality of their instruction set, the fact is that it's far 
from orthogonal.  For example, you can read a variable from its 
address:
 
 MOVE X,D0 (where X is the name of a variable) 
 
but you can't write in the same way.  To write, you must load an 
address register with the address of X.  The same is true for PC-
relative addressing:
 
 MOVE X(PC),DO (legal) 
 MOVE D0,X(PC) (illegal) 
 
When you begin asking how such non-orthogonal behavior came about, 
you find that someone in Motorola had some theories about how 
software should be written.  Specifically, in this case, they 
decided that self-modifying code, which you can implement using 
PC-relative writes, is a Bad Thing.  Therefore, they designed the 
processor to prohibit it.  Unfortunately, in the process they also 
prohibited _ALL_ writes of the forms shown above, however benign.  
Note that this was not something done by default.  Extra design 
work had to be done, and extra gates added, to destroy the natural 
orthogonality of the instruction set. 
 
One of the lessons I've learned from life: If you have two 
choices, and can't decide which one to take, sometimes the best 
thing to do is nothing.  Why add extra gates to a processor to 
enforce some stranger's idea of good programming practice?  Leave 
the instructions in, and let the programmers debate what good 
programming practice is.  Similarly, why should we add extra code 
to our parser, to test for and prevent conditions that the user 
might prefer to do, anyway?  I'd rather leave the compiler simple, 
and let the software experts debate whether the practices should 
be used or not. 
 
All of which serves as rationalization for my decision as to how 
to prevent mixed-type arithmetic:  I won't.  For a language 
intended for systems programming, the fewer rules, the better. If 
you don't agree, and want to test for such conditions, we can do 
it once we have a symbol table. 
 
BOOLEAN "AND" 
 
With that bit of philosophy out of the way, we can press on to the 
"and" operator, which goes into procedure Term. By now, you can 
probably do this without me, but here's the code, anyway: 
 
In Scanner, 
 
{--------------------------------------------------------------} 
function IsMulop(c: char): boolean; 
begin 
 IsMulop := c in ['*','/', '&']; 
end; 
{--------------------------------------------------------------} 
 
In Parser, 
 
 
{--------------------------------------------------------------} 
procedure Term; 
begin 
 Factor; 
 while IsMulop(Look) do 
  case Look of 
   '*': Multiply; 
   '/': Divide; 
   '&': _And; 
  end; 
end; 
 
{--------------------------------------------------------------} 
{ Parse and Translate a Boolean And Operation } 
 
procedure _And; 
begin 
 Match('&'); 
 Push; 
 Factor; 
 PopAnd; 
end; 
{--------------------------------------------------------------} 
 
and in CodeGen, 
 
 
{--------------------------------------------------------------} 
{ And Primary with TOS } 
 
procedure PopAnd; 
begin 
 EmitLn('AND (SP)+,D0'); 
end; 
{--------------------------------------------------------------} 
 
Your parser should now be able to process almost any sort of logical 
expression, and (should you be so inclined), mixed-mode expressions as 
well. 
 
Why not "all sorts of logical expressions"?  Because, so far, we haven't 
dealt with the logical "not" operator, and this is where it gets tricky.  
The logical "not" operator seems, at first glance, to be identical in 
its behavior to the unary minus, so my first thought was to let the 
exclusive or operator, '~', double as the unary "not."  That didn't 
work. In my first attempt, procedure SignedTerm simply ate my '~', 
because the character passed the test for an addop, but SignedTerm 
ignores all addops except '-'.  It would have been easy enough to add 
another line to SignedTerm, but that would still not solve the problem, 
because note that Expression only accepts a signed term for the _FIRST_ 
argument.   
 
Mathematically, an expression like: 
 
 -a * -b 
 
makes little or no sense, and the parser should flag it as an error.  
But the same expression, using a logical "not," makes perfect sense: 
 
 not a and not b 
 
In the case of these unary operators, choosing to make them act the same 
way seems an artificial force fit, sacrificing reasonable behavior on 
the altar of implementational ease.  While I'm all for keeping the 
implementation as simple as possible, I don't think we should do so at 
the expense of reasonableness.  Patching like this would be missing the 
main point, which is that the logical "not" is simply NOT the same kind 
of animal as the unary minus.  Consider the exclusive or, which is most 
naturally written as:   
 
 a~b ::= (a and not b) or (not a and b) 
 
If we allow the "not" to modify the whole term, the last term in 
parentheses would be interpreted as: 
 
 not(a and b) 
 
which is not the same thing at all.  So it's clear that the logical 
"not" must be thought of as connected to the FACTOR, not the term. 
 
The idea of overloading the '~' operator also makes no sense from a 
mathematical point of view.  The implication of the unary minus is that 
it's equivalent to a subtraction from zero: 
 
 -x <=> 0-x 
 
In fact, in one of my more simple-minded versions of Expression, I 
reacted to a leading addop by simply preloading a zero, then processing 
the operator as though it were a binary operator.  But a "not" is not 
equivalent to an exclusive or with zero ... that would just give back 
the original number.  Instead, it's an exclusive or with FFFFh, or -1. 
 
In short, the seeming parallel between the unary "not" and the unary 
minus falls apart under closer scrutiny. "not" modifies the factor, not 
the term, and it is not related to either the unary minus nor the 
exclusive or.  Therefore, it deserves a symbol to call its own. What 
better symbol than the obvious one, also used by C, the '!' character?  
Using the rules about the way we think the "not" should behave, we 
should be able to code the exclusive or (assuming we'd ever need to), in 
the very natural form: 
 
 a & !b | !a & b 
 
Note that no parentheses are required -- the precedence levels we've 
chosen automatically take care of things. 
 
If you're keeping score on the precedence levels, this definition puts 
the '!' at the top of the heap.  The levels become: 
 
1. ! 
2. - (unary) 
3. *, /, & 
4. +, -, |, ~ 
 
Looking at this list, it's certainly not hard to see why we had trouble 
using '~' as the "not" symbol! 
 
So how do we mechanize the rules?  In the same way as we did with 
SignedTerm, but at the factor level.  We'll define a procedure 
NotFactor: 
 
 
{--------------------------------------------------------------} 
{ Parse and Translate a Factor with Optional "Not" } 
 
procedure NotFactor; 
begin 
 if Look ='!' then begin 
  Match('!'); 
  Factor; 
  Notit; 
  end 
 else 
  Factor; 
end; 
{--------------------------------------------------------------} 
 
 
and call it from all the places where we formerly called Factor, i.e., 
from Term, Multiply, Divide, and _And.  Note the new code generation 
procedure: 
 
 
{--------------------------------------------------------------} 
{ Bitwise Not Primary } 
 
procedure NotIt; 
begin 
 EmitLn('EOR #-1,D0'); 
end; 
 
{--------------------------------------------------------------} 
 
 
Try this now, with a few simple cases. In fact, try that exclusive or 
example, 
 
 a&!b|!a&b 
 
 
You should get the code (without the comments, of course): 
 
 MOVE A(PC),DO    ; load a 
 MOVE D0,-(SP)  ; push it 
 MOVE B(PC),DO  ; load b 
 EOR #-1,D0  ; not it 
 AND (SP)+,D0  ; and with a 
 MOVE D0,-(SP)  ; push result 
 MOVE A(PC),DO  ; load a 
 EOR #-1,D0  ; not it 
 MOVE D0,-(SP)  ; push it 
 MOVE B(PC),DO  ; load b 
 AND (SP)+,D0  ; and with !a 
 OR (SP)+,D0  ; or with first term 
 
That's precisely what we'd like to get.  So, at least for both 
arithmetic and logical operators, our new precedence and new, slimmer 
syntax hang together.  Even the peculiar, but legal, expression with 
leading addop: 
 
 ~x 
 
makes sense.  SignedTerm ignores the leading '~', as it should, since 
the expression is equivalent to: 
 
 0~x, 
 
which is equal to x. 
 
When we look at the BNF we've created, we find that our boolean algebra 
now adds only one extra line: 
 
 
<not_factor>  ::= [!] <factor> 
<factor>       ::= <variable> | <constant> | '(' <expression> ')'  
<signed_term>  ::= [<addop>] <term> 
<term>         ::= <not_factor> (<mulop> <not_factor>)*  
<expression>   ::= <signed_term> (<addop> <term>)* 
<assignment>   ::= <variable> '=' <expression> 
 
 
That's a big improvement over earlier efforts.  Will our luck continue 
to hold when we get to relational operators?  We'll find out soon, but 
it will have to wait for the next installment. We're at a good stopping 
place, and I'm anxious to get this installment into your hands.  It's 
already been a year since the release of Installment 15.  I blush to 
admit that all of this current installment has been ready for almost as 
long, with the exception of relational operators.  But the information 
does you no good at all, sitting on my hard disk, and by holding it back 
until the relational operations were done, I've kept it out of your 
hands for that long.  It's time for me to let go of it and get it out 
where you can get value from it. Besides, there are quite a number of 
serious philosophical questions associated with the relational 
operators, as well, and I'd rather save them for a separate installment 
where I can do them justice. 
 
Have fun with the new, leaner arithmetic and logical parsing, and I'll 
see you soon with relationals. 
 


***************************************************************** 
*                                                               * 
*                        COPYRIGHT NOTICE                       * 
*                                                               * 
*   Copyright (C) 1995 Jack W. Crenshaw. All rights reserved.   *   
*                                                               * 
*                                                               * 
***************************************************************** 
 
--
--/*\ Jon A. Lambert - TychoMUD     Internet:jlsysinc at ix.netcom.com /*\--
--/*\ Mud Server Developer's Page <http://www.netcom.com/~jlsysinc> /*\--
--/*\   "Everything that deceives may be said to enchant" - Plato   /*\--





More information about the mud-dev-archive mailing list