Tutorial: Let's build a Compiler! - Part XVI: Unit Construction
Jon A. Lambert
jlsysinc at ix.netcom.com
Wed Mar 4 23:23:53 CET 1998
LET'S BUILD A COMPILER!
By
Jack W. Crenshaw, Ph.D.
29 May, 1995
Part 16: UNIT CONSTRUCTION
*****************************************************************
* *
* COPYRIGHT NOTICE *
* *
* Copyright (C) 1995 Jack W. Crenshaw. All rights reserved. *
* *
*****************************************************************
INTRODUCTION
This series of tutorials promises to be perhaps one of the longest-
running mini-series in history, rivalled only by the delay in Volume IV
of Knuth. Begun in 1988, the series ran into a four-year hiatus in 1990
when the "cares of this world," changes in priorities and interests, and
the need to make a living seemed to stall it out after Installment 14.
Those of you with loads of patience were finally rewarded, in the spring
of last year, with the long-awaited Installment 15. In it, I began to
try to steer the series back on track, and in the process, to make it
easier to continue on to the goal, which is to provide you with not only
enough understanding of the difficult subject of compiler theory, but
also enough tools, in the form of canned subroutines and concepts, so
that you would be able to continue on your own and become proficient
enough to build your own parsers and translators. Because of that long
hiatus, I thought it appropriate to go back and review the concepts we
have covered so far, and to redo some of the software, as well. In the
past, we've never concerned ourselves much with the development of
production-quality software tools ... after all, I was trying to teach
(and learn) concepts, not production practice. To do that, I tended to
give you, not complete compilers or parsers, but only those snippets of
code that illustrated the particular point we were considering at the
moment.
I still believe that's a good way to learn any subject; no one wants to
have to make changes to 100,000 line programs just to try out a new
idea. But the idea of just dealing with code snippets, rather than
complete programs, also has its drawbacks in that we often seemed to be
writing the same code fragments over and over. Although repetition has
been thoroughly proven to be a good way to learn new ideas, it's also
true that one can have too much of a good thing. By the time I had
completed Installment 14 I seemed to have reached the limits of my
abilities to juggle multiple files and multiple versions of the same
software functions. Who knows, perhaps that's one reason I seemed to
have run out of gas at that point.
Fortunately, the later versions of Borland's Turbo Pascal allow us to
have our cake and eat it too. By using their concept of separately
compilable units, we can still write small subroutines and functions,
and keep our main programs and test programs small and simple. But,
once written, the code in the Pascal units will always be there for us
to use, and linking them in is totally painless and transparent.
Since, by now, most of you are programming in either C or C++, I know
what you're thinking: Borland, with their Turbo Pascal (TP), certainly
didn't invent the concept of separately compilable modules. And of
course you're right. But if you've not used TP lately, or ever, you may
not realize just how painless the whole process is. Even in C or C++,
you still have to build a make file, either manually or by telling the
compiler how to do so. You must also list, using "extern" statements or
header files, the functions you want to import. In TP, you don't even
have to do that. You need only name the units you wish to use, and all
of their procedures automatically become available.
It's not my intention to get into a language-war debate here, so I won't
pursue the subject any further. Even I no longer use Pascal on my job
... I use C at work and C++ for my articles in Embedded Systems
Programming and other magazines. Believe me, when I set out to
resurrect this series, I thought long and hard about switching both
languages and target systems to the ones that we're all using these
days, C/C++ and PC architecture, and possibly object-oriented methods as
well. In the end, I felt it would cause more confusion than the hiatus
itself has. And after all, Pascal still remains one of the best possible
languages for teaching, not to mention production programming. Finally,
TP still compiles at the speed of light, much faster than competing
C/C++ compilers. And Borland's smart linker, used in TP but not in their
C++ products, is second to none. Aside from being much faster than
Microsoft-compatible linkers, the Borland smart linker will cull unused
procedures and data items, even to the extent of trimming them out of
defined objects if they're not needed. For one of the few times in our
lives, we don't have to compromise between completeness and efficiency.
When we're writing a TP unit, we can make it as complete as we like,
including any member functions and data items we may think we will ever
need, confident that doing so will not create unwanted bloat in the
compiled and linked executable.
The point, really, is simply this: By using TP's unit mechanism, we can
have all the advantages and convenience of writing small, seemingly
stand-alone test programs, without having to constantly rewrite the
support functions that we need. Once written, the TP units sit there,
quietly waiting to do their duty and give us the support we need, when
we need it.
Using this principle, in Installment 15 I set out to minimize our
tendency to re-invent the wheel by organizing our code into separate
Turbo Pascal units, each containing different parts of the compiler. We
ended up with the following units:
* Input
* Output
* Errors
* Scanner
* Parser
* CodeGen
Each of these units serves a different function, and encapsulates
specific areas of functionality. The Input and Output units, as their
name implies, provide character stream I/O and the all-important
lookahead character upon which our predictive parser is based. The
Errors unit, of course, provides standard error handling. The Scanner
unit contains all of our boolean functions such as IsAlpha, and the
routines GetName and GetNumber, which process multi-character tokens.
The two units we'll be working with the most, and the ones that most
represent the personality of our compiler, are Parser and CodeGen.
Theoretically, the Parser unit should encapsulate all aspects of the
compiler that depend on the syntax of the compiled language (though, as
we saw last time, a small amount of this syntax spills over into
Scanner). Similarly, the code generator unit, CodeGen, contains all of
the code dependent upon the target machine. In this installment, we'll
be continuing with the development of the functions in these two all-
important units.
JUST LIKE CLASSICAL?
Before we proceed, however, I think I should clarify the relationship
between, and the functionality of these units. Those of you who are
familiar with compiler theory as taught in universities will, of course,
recognize the names, Scanner, Parser, and CodeGen, all of which are
components of a classical compiler implementation. You may be thinking
that I've abandoned my commitment to the KISS philosophy, and drifted
towards a more conventional architecture than we once had. A closer
look, however, should convince you that, while the names are similar,
the functionalities are quite different.
Together, the scanner and parser of a classical implementation comprise
the so-called "front end," and the code generator, the back end. The
front end routines process the language-dependent, syntax-related
aspects of the source language, while the code generator, or back end,
deals with the target machine-dependent parts of the problem. In
classical compilers, the two ends communicate via a file of instructions
written in an intermediate language (IL).
Typically, a classical scanner is a single procedure, operating as a co-
procedure with the parser. It "tokenizes" the source file, reading it
character by character, recognizing language elements, translating them
into tokens, and passing them along to the parser. You can think of the
parser as an abstract machine, executing "op codes," which are the
tokens. Similarly, the parser generates op codes of a second abstract
machine, which mechanizes the IL. Typically, the IL file is written to
disk by the parser, and read back again by the code generator.
Our organization is quite different. We have no lexical scanner, in the
classical sense; our unit Scanner, though it has a similar name, is not
a single procedure or co-procedure, but merely a set of separate
subroutines which are called by the parser as needed.
Similarly, the classical code generator, the back end, is a translator
in its own right, reading an IL "source" file, and emitting an object
file. Our code generator doesn't work that way. In our compiler, there
IS no intermediate language; every construct in the source language
syntax is converted into assembly language as it is recognized by the
parser. Like Scanner, the unit CodeGen consists of individual
procedures which are called by the parser as needed.
This "code 'em as you find 'em" philosophy may not produce the world's
most efficient code -- for example, we haven't provided (yet!) a
convenient place for an optimizer to work its magic -- but it sure does
simplify the compiler, doesn't it?
And that observation prompts me to reflect, once again, on how we have
managed to reduce a compiler's functions to such comparatively simple
terms. I've waxed eloquent on this subject in past installments, so I
won't belabor the point too much here. However, because of the time
that's elapsed since those last soliloquies, I hope you'll grant me just
a little time to remind myself, as well as you, how we got here. We got
here by applying several principles that writers of commercial compilers
seldom have the luxury of using. These are:
o The KISS philosophy -- Never do things the hard way without a
reason
o Lazy coding -- Never put off until tomorrow what you can put
of forever (with credits to P.J. Plauger)
o Skepticism -- Stubborn refusal to do something just because
that's the way it's always been done.
o Acceptance of inefficient code
o Rejection of arbitrary constraints
As I've reviewed the history of compiler construction, I've learned that
virtually every production compiler in history has suffered from pre-
imposed conditions that strongly influenced its design. The original
FORTRAN compiler of John Backus, et al, had to compete with assembly
language, and therefore was constrained to produce extremely efficient
code. The IBM compilers for the minicomputers of the 70's had to run in
the very small RAM memories then available -- as small as 4k. The early
Ada compiler had to compile itself. Per Brinch Hansen decreed that his
Pascal compiler developed for the IBM PC must execute in a 64k machine.
Compilers developed in Computer Science courses had to compile the
widest variety of languages, and therefore required LALR parsers.
In each of these cases, these preconceived constraints literally
dominated the design of the compiler.
A good example is Brinch Hansen's compiler, described in his excellent
book, "Brinch Hansen on Pascal Compilers" (highly recommended). Though
his compiler is one of the most clear and un-obscure compiler
implementations I've seen, that one decision, to compile large files in
a small RAM, totally drives the design, and he ends up with not just
one, but many intermediate files, together with the drivers to write and
read them.
In time, the architectures resulting from such decisions have found
their way into computer science lore as articles of faith. In this one
man's opinion, it's time that they were re-examined critically. The
conditions, environments, and requirements that led to classical
architectures are not the same as the ones we have today. There's no
reason to believe the solutions should be the same, either.
In this tutorial, we've followed the leads of such pioneers in the world
of small compilers for Pcs as Leor Zolman, Ron Cain, and James Hendrix,
who didn't know enough compiler theory to know that they "couldn't do it
that way." We have resolutely refused to accept arbitrary constraints,
but rather have done whatever was easy. As a result, we have evolved an
architecture that, while quite different from the classical one, gets
the job done in very simple and straightforward fashion.
I'll end this philosophizing with an observation re the notion of an
intermediate language. While I've noted before that we don't have one
in our compiler, that's not exactly true; we _DO_ have one, or at least
are evolving one, in the sense that we are defining code generation
functions for the parser to call. In essence, every call to a code
generation procedure can be thought of as an instruction in an
intermediate language. Should we ever find it necessary to formalize an
intermediate language, this is the way we would do it: emit codes from
the parser, each representing a call to one of the code generator
procedures, and then process each code by calling those procedures in a
separate pass, implemented in a back end. Frankly, I don't see that
we'll ever find a need for this approach, but there is the connection,
if you choose to follow it, between the classical and the current
approaches.
FLESHING OUT THE PARSER
Though I promised you, somewhere along about Installment 14, that we'd
never again write every single function from scratch, I ended up
starting to do just that in Installment 15. One reason: that long
hiatus between the two installments made a review seem eminently
justified ... even imperative, both for you and for me. More
importantly, the decision to collect the procedures into modules
(units), forced us to look at each one yet again, whether we wanted to
or not. And, finally and frankly, I've had some new ideas in the last
four years that warranted a fresh look at some old friends. When I
first began this series, I was frankly amazed, and pleased, to learn
just how simple parsing routines can be made. But this last time
around, I've surprised myself yet again, and been able to make them just
that last little bit simpler, yet.
Still, because of this total rewrite of the parsing modules, I was only
able to include so much in the last installment. Because of this, our
hero, the parser, when last seen, was a shadow of his former self,
consisting of only enough code to parse and process a factor consisting
of either a variable or a constant. The main effort of this current
installment will be to help flesh out the parser to its former glory.
In the process, I hope you'll bear with me if we sometimes cover ground
we've long since been over and dealt with.
First, let's take care of a problem that we've addressed before: Our
current version of procedure Factor, as we left it in Installment 15,
can't handle negative arguments. To fix that, we'll introduce the
procedure SignedFactor:
{--------------------------------------------------------------}
{ Parse and Translate a Factor with Optional Sign }
procedure SignedFactor;
var Sign: char;
begin
Sign := Look;
if IsAddop(Look) then
GetChar;
Factor;
if Sign = '-' then Negate;
end;
{--------------------------------------------------------------}
Note that this procedure calls a new code generation routine, Negate:
{--------------------------------------------------------------}
{ Negate Primary }
procedure Negate;
begin
EmitLn('NEG D0');
end;
{--------------------------------------------------------------}
(Here, and elsewhere in this series, I'm only going to show you the new
routines. I'm counting on you to put them into the proper unit, which
you should normally have no trouble identifying. Don't forget to add
the procedure's prototype to the interface section of the unit.)
In the main program, simply change the procedure called from Factor to
SignedFactor, and give the code a test. Isn't it neat how the Turbo
linker and make facility handle all the details?
Yes, I know, the code isn't very efficient. If we input a number, -3,
the generated code is:
MOVE #3,D0
NEG D0
which is really, really dumb. We can do better, of course, by simply
pre-appending a minus sign to the string passed to LoadConstant, but it
adds a few lines of code to SignedFactor, and I'm applying the KISS
philosophy very aggressively here. What's more, to tell the truth, I
think I'm subconsciously enjoying generating "really, really dumb" code,
so I can have the pleasure of watching it get dramatically better when
we get into optimization methods.
Most of you have never heard of John Spray, so allow me to introduce him
to you here. John's from New Zealand, and used to teach computer
science at one of its universities. John wrote a compiler for the
Motorola 6809, based on a delightful, Pascal-like language of his own
design called "Whimsical." He later ported the compiler to the 68000,
and for awhile it was the only compiler I had for my homebrewed 68000
system.
For the record, one of my standard tests for any new compiler is to see
how the compiler deals with a null program like:
program main;
begin
end.
My test is to measure the time required to compile and link, and the
size of the object file generated. The undisputed _LOSER_ in the test
is the DEC C compiler for the VAX, which took 60 seconds to compile, on
a VAX 11/780, and generated a 50k object file. John's compiler is the
undisputed, once, future, and forever king in the code size department.
Given the null program, Whimsical generates precisely two bytes of code,
implementing the one instruction,
RET
By setting a compiler option to generate an include file rather than a
standalone program, John can even cut this size, from two bytes to zero!
Sort of hard to beat a null object file, wouldn't you say?
Needless to say, I consider John to be something of an expert on code
optimization, and I like what he has to say: "The best way to optimize
is not to have to optimize at all, but to produce good code in the first
place." Words to live by. When we get started on optimization, we'll
follow John's advice, and our first step will not be to add a peephole
optimizer or other after-the-fact device, but to improve the quality of
the code emitted before optimization. So make a note of SignedFactor as
a good first candidate for attention, and for now we'll leave it be.
TERMS AND EXPRESSIONS
I'm sure you know what's coming next: We must, yet again, create the
rest of the procedures that implement the recursive-descent parsing of
an expression. We all know that the hierarchy of procedures for
arithmetic expressions is:
expression
term
factor
However, for now let's continue to do things one step at a time,
and consider only expressions with additive terms in them. The
code to implement expressions, including a possibly signed first
term, is shown next:
{--------------------------------------------------------------}
{ Parse and Translate an Expression }
procedure Expression;
begin
SignedFactor;
while IsAddop(Look) do
case Look of
'+': Add;
'-': Subtract;
end;
end;
{--------------------------------------------------------------}
This procedure calls two other procedures to process the
operations:
{--------------------------------------------------------------}
{ Parse and Translate an Addition Operation }
procedure Add;
begin
Match('+');
Push;
Factor;
PopAdd;
end;
{--------------------------------------------------------------}
{ Parse and Translate a Subtraction Operation }
procedure Subtract;
begin
Match('-');
Push;
Factor;
PopSub;
end;
{--------------------------------------------------------------}
The three procedures Push, PopAdd, and PopSub are new code generation
routines. As the name implies, procedure Push generates code to push
the primary register (D0, in our 68000 implementation) to the stack.
PopAdd and PopSub pop the top of the stack again, and add it to, or
subtract it from, the primary register. The code is shown next:
{--------------------------------------------------------------}
{ Push Primary to Stack }
procedure Push;
begin
EmitLn('MOVE D0,-(SP)');
end;
{--------------------------------------------------------------}
{ Add TOS to Primary }
procedure PopAdd;
begin
EmitLn('ADD (SP)+,D0');
end;
{--------------------------------------------------------------}
{ Subtract TOS from Primary }
procedure PopSub;
begin
EmitLn('SUB (SP)+,D0');
Negate;
end;
{--------------------------------------------------------------}
Add these routines to Parser and CodeGen, and change the main program to
call Expression. Voila!
The next step, of course, is to add the capability for dealing with
multiplicative terms. To that end, we'll add a procedure Term, and code
generation procedures PopMul and PopDiv. These code generation
procedures are shown next:
{--------------------------------------------------------------}
{ Multiply TOS by Primary }
procedure PopMul;
begin
EmitLn('MULS (SP)+,D0');
end;
{--------------------------------------------------------------}
{ Divide Primary by TOS }
procedure PopDiv;
begin
EmitLn('MOVE (SP)+,D7');
EmitLn('EXT.L D7');
EmitLn('DIVS D0,D7');
EmitLn('MOVE D7,D0');
end;
{--------------------------------------------------------------}
I admit, the division routine is a little busy, but there's no help for
it. Unfortunately, while the 68000 CPU allows a division using the top
of stack (TOS), it wants the arguments in the wrong order, just as it
does for subtraction. So our only recourse is to pop the stack to a
scratch register (D7), perform the division there, and then move the
result back to our primary register, D0. Note the use of signed multiply
and divide operations. This follows an implied, but unstated,
assumption, that all our variables will be signed 16-bit integers. This
decision will come back to haunt us later, when we start looking at
multiple data types, type conversions, etc.
Our procedure Term is virtually a clone of Expression, and looks like
this:
{--------------------------------------------------------------}
{ Parse and Translate a Term }
procedure Term;
begin
Factor;
while IsMulop(Look) do
case Look of
'*': Multiply;
'/': Divide;
end;
end;
{--------------------------------------------------------------}
Our next step is to change some names. SignedFactor now becomes
SignedTerm, and the calls to Factor in Expression, Add, Subtract and
SignedTerm get changed to call Term:
{--------------------------------------------------------------}
{ Parse and Translate a Term with Optional Leading Sign }
procedure SignedTerm;
var Sign: char;
begin
Sign := Look;
if IsAddop(Look) then
GetChar;
Term;
if Sign = '-' then Negate;
end;
{--------------------------------------------------------------}
...
{--------------------------------------------------------------}
{ Parse and Translate an Expression }
procedure Expression;
begin
SignedTerm;
while IsAddop(Look) do
case Look of
'+': Add;
'-': Subtract;
end;
end;
{--------------------------------------------------------------}
If memory serves me correctly, we once had BOTH a procedure SignedFactor
and a procedure SignedTerm. I had reasons for doing that at the time ...
they had to do with the handling of Boolean algebra and, in particular,
the Boolean "not" function. But certainly, for arithmetic operations,
that duplication isn't necessary. In an expression like:
-x*y
it's very apparent that the sign goes with the whole TERM, x*y, and not
just the factor x, and that's the way Expression is coded.
Test this new code by executing Main. It still calls Expression, so you
should now be able to deal with expressions containing any of the four
arithmetic operators.
Our last bit of business, as far as expressions goes, is to modify
procedure Factor to allow for parenthetical expressions. By using a
recursive call to Expression, we can reduce the needed code to virtually
nothing. Five lines added to Factor do the job:
{--------------------------------------------------------------}
{ Parse and Translate a Factor }
procedure Factor;
begin
if Look ='(' then begin
Match('(');
Expression;
Match(')');
end
else if IsDigit(Look) then
LoadConstant(GetNumber)
else if IsAlpha(Look)then
LoadVariable(GetName)
else
Error('Unrecognized character ' + Look);
end;
{--------------------------------------------------------------}
At this point, your "compiler" should be able to handle any legal
expression you can throw at it. Better yet, it should reject all
illegal ones!
ASSIGNMENTS
As long as we're this close, we might as well create the code to deal
with an assignment statement. This code needs only to remember the name
of the target variable where we are to store the result of an
expression, call Expression, then store the number. The procedure is
shown next:
{--------------------------------------------------------------}
{ Parse and Translate an Assignment Statement }
procedure Assignment;
var Name: string;
begin
Name := GetName;
Match('=');
Expression;
StoreVariable(Name);
end;
{--------------------------------------------------------------}
The assignment calls for yet another code generation routine:
{--------------------------------------------------------------}
{ Store the Primary Register to a Variable }
procedure StoreVariable(Name: string);
begin
EmitLn('LEA ' + Name + '(PC),A0');
EmitLn('MOVE D0,(A0)');
end;
{--------------------------------------------------------------}
Now, change the call in Main to call Assignment, and you should see a
full assignment statement being processed correctly. Pretty neat, eh?
And painless, too.
In the past, we've always tried to show BNF relations to define the
syntax we're developing. I haven't done that here, and it's high time I
did. Here's the BNF:
<factor> ::= <variable> | <constant> | '(' <expression> ')'
<signed_term> ::= [<addop>] <term>
<term> ::= <factor> (<mulop> <factor>)*
<expression> ::= <signed_term> (<addop> <term>)*
<assignment> ::= <variable> '=' <expression>
BOOLEANS
The next step, as we've learned several times before, is to add Boolean
algebra. In the past, this step has at least doubled the amount of code
we've had to write. As I've gone over this step in my mind, I've found
myself diverging more and more from what we did in previous
installments. To refresh your memory, I noted that Pascal treats the
Boolean operators pretty much identically to the way it treats
arithmetic ones. A Boolean "and" has the same precedence level as
multiplication, and the "or" as addition. C, on the other hand, sets
them at different precedence levels, and all told has a whopping 17
levels. In our earlier work, I chose something in between, with seven
levels. As a result, we ended up with things called Boolean
expressions, paralleling in most details the arithmetic expressions, but
at a different precedence level. All of this, as it turned out, came
about because I didn't like having to put parentheses around the Boolean
expressions in statements like:
IF (c >= 'A') and (c <= 'Z') then ...
In retrospect, that seems a pretty petty reason to add many layers of
complexity to the parser. Perhaps more to the point, I'm not sure I was
even able to avoid the parens.
For kicks, let's start anew, taking a more Pascal-ish approach, and just
treat the Boolean operators at the same precedence level as the
arithmetic ones. We'll see where it leads us. If it seems to be down
the garden path, we can always backtrack to the earlier approach.
For starters, we'll add the "addition-level" operators to Expression.
That's easily done; first, modify the function IsAddop in unit Scanner
to include two extra operators: '|' for "or," and '~' for "exclusive
or":
{--------------------------------------------------------------}
function IsAddop(c: char): boolean;
begin
IsAddop := c in ['+','-', '|', '~'];
end;
{--------------------------------------------------------------}
Next, we must include the parsing of the operators in procedure
Expression:
{--------------------------------------------------------------}
procedure Expression;
begin
SignedTerm;
while IsAddop(Look) do
case Look of
'+': Add;
'-': Subtract;
'|': _Or;
'~': _Xor;
end;
{--------------------------------------------------------------}
end;
(The underscores are needed, of course, because "or" and "xor" are
reserved words in Turbo Pascal.)
Next, the procedures _Or and _Xor:
{--------------------------------------------------------------}
{ Parse and Translate a Subtraction Operation }
procedure _Or;
begin
Match('|');
Push;
Term;
PopOr;
end;
{--------------------------------------------------------------}
{ Parse and Translate a Subtraction Operation }
procedure _Xor;
begin
Match('~');
Push;
Term;
PopXor;
end;
{--------------------------------------------------------------}
And, finally, the new code generator procedures:
{--------------------------------------------------------------}
{ Or TOS with Primary }
procedure PopOr;
begin
EmitLn('OR (SP)+,D0');
end;
{--------------------------------------------------------------}
{ Exclusive-Or TOS with Primary }
procedure PopXor;
begin
EmitLn('EOR (SP)+,D0');
end;
{--------------------------------------------------------------}
Now, let's test the translator (you might want to change the call
in Main back to a call to Expression, just to avoid having to type
"x=" for an assignment every time).
So far, so good. The parser nicely handles expressions of the
form:
x|y~z
Unfortunately, it also does nothing to protect us from mixing
Boolean and arithmetic algebra. It will merrily generate code
for:
(a+b)*(c~d)
We've talked about this a bit, in the past. In general the rules
for what operations are legal or not cannot be enforced by the
parser itself, because they are not part of the syntax of the
language, but rather its semantics. A compiler that doesn't allow
mixed-mode expressions of this sort must recognize that c and d
are Boolean variables, rather than numeric ones, and balk at
multiplying them in the next step. But this "policing" can't be
done by the parser; it must be handled somewhere between the
parser and the code generator. We aren't in a position to enforce
such rules yet, because we haven't got either a way of declaring
types, or a symbol table to store the types in. So, for what
we've got to work with at the moment, the parser is doing
precisely what it's supposed to do.
Anyway, are we sure that we DON'T want to allow mixed-type
operations? We made the decision some time ago (or, at least, I
did) to adopt the value 0000 as a Boolean "false," and -1, or
FFFFh, as a Boolean "true." The nice part about this choice is
that bitwise operations work exactly the same way as logical ones.
In other words, when we do an operation on one bit of a logical
variable, we do it on all of them. This means that we don't need
to distinguish between logical and bitwise operations, as is done
in C with the operators & and &&, and | and ||. Reducing the
number of operators by half certainly doesn't seem all bad.
>From the point of view of the data in storage, of course, the
computer and compiler couldn't care less whether the number FFFFh
represents the logical TRUE, or the numeric -1. Should we? I
sort of think not. I can think of many examples (though they
might be frowned upon as "tricky" code) where the ability to mix
the types might come in handy. Example, the Dirac delta function,
which could be coded in one simple line:
-(x=0)
or the absolute value function (DEFINITELY tricky code!):
x*(1+2*(x<0))
Please note, I'm not advocating coding like this as a way of life.
I'd almost certainly write these functions in more readable form,
using IFs, just to keep from confusing later maintainers. Still,
a moral question arises: Do we have the right to ENFORCE our
ideas of good coding practice on the programmer, but writing the
language so he can't do anything else? That's what Nicklaus Wirth
did, in many places in Pascal, and Pascal has been criticized for
it -- for not being as "forgiving" as C.
An interesting parallel presents itself in the example of the
Motorola 68000 design. Though Motorola brags loudly about the
orthogonality of their instruction set, the fact is that it's far
from orthogonal. For example, you can read a variable from its
address:
MOVE X,D0 (where X is the name of a variable)
but you can't write in the same way. To write, you must load an
address register with the address of X. The same is true for PC-
relative addressing:
MOVE X(PC),DO (legal)
MOVE D0,X(PC) (illegal)
When you begin asking how such non-orthogonal behavior came about,
you find that someone in Motorola had some theories about how
software should be written. Specifically, in this case, they
decided that self-modifying code, which you can implement using
PC-relative writes, is a Bad Thing. Therefore, they designed the
processor to prohibit it. Unfortunately, in the process they also
prohibited _ALL_ writes of the forms shown above, however benign.
Note that this was not something done by default. Extra design
work had to be done, and extra gates added, to destroy the natural
orthogonality of the instruction set.
One of the lessons I've learned from life: If you have two
choices, and can't decide which one to take, sometimes the best
thing to do is nothing. Why add extra gates to a processor to
enforce some stranger's idea of good programming practice? Leave
the instructions in, and let the programmers debate what good
programming practice is. Similarly, why should we add extra code
to our parser, to test for and prevent conditions that the user
might prefer to do, anyway? I'd rather leave the compiler simple,
and let the software experts debate whether the practices should
be used or not.
All of which serves as rationalization for my decision as to how
to prevent mixed-type arithmetic: I won't. For a language
intended for systems programming, the fewer rules, the better. If
you don't agree, and want to test for such conditions, we can do
it once we have a symbol table.
BOOLEAN "AND"
With that bit of philosophy out of the way, we can press on to the
"and" operator, which goes into procedure Term. By now, you can
probably do this without me, but here's the code, anyway:
In Scanner,
{--------------------------------------------------------------}
function IsMulop(c: char): boolean;
begin
IsMulop := c in ['*','/', '&'];
end;
{--------------------------------------------------------------}
In Parser,
{--------------------------------------------------------------}
procedure Term;
begin
Factor;
while IsMulop(Look) do
case Look of
'*': Multiply;
'/': Divide;
'&': _And;
end;
end;
{--------------------------------------------------------------}
{ Parse and Translate a Boolean And Operation }
procedure _And;
begin
Match('&');
Push;
Factor;
PopAnd;
end;
{--------------------------------------------------------------}
and in CodeGen,
{--------------------------------------------------------------}
{ And Primary with TOS }
procedure PopAnd;
begin
EmitLn('AND (SP)+,D0');
end;
{--------------------------------------------------------------}
Your parser should now be able to process almost any sort of logical
expression, and (should you be so inclined), mixed-mode expressions as
well.
Why not "all sorts of logical expressions"? Because, so far, we haven't
dealt with the logical "not" operator, and this is where it gets tricky.
The logical "not" operator seems, at first glance, to be identical in
its behavior to the unary minus, so my first thought was to let the
exclusive or operator, '~', double as the unary "not." That didn't
work. In my first attempt, procedure SignedTerm simply ate my '~',
because the character passed the test for an addop, but SignedTerm
ignores all addops except '-'. It would have been easy enough to add
another line to SignedTerm, but that would still not solve the problem,
because note that Expression only accepts a signed term for the _FIRST_
argument.
Mathematically, an expression like:
-a * -b
makes little or no sense, and the parser should flag it as an error.
But the same expression, using a logical "not," makes perfect sense:
not a and not b
In the case of these unary operators, choosing to make them act the same
way seems an artificial force fit, sacrificing reasonable behavior on
the altar of implementational ease. While I'm all for keeping the
implementation as simple as possible, I don't think we should do so at
the expense of reasonableness. Patching like this would be missing the
main point, which is that the logical "not" is simply NOT the same kind
of animal as the unary minus. Consider the exclusive or, which is most
naturally written as:
a~b ::= (a and not b) or (not a and b)
If we allow the "not" to modify the whole term, the last term in
parentheses would be interpreted as:
not(a and b)
which is not the same thing at all. So it's clear that the logical
"not" must be thought of as connected to the FACTOR, not the term.
The idea of overloading the '~' operator also makes no sense from a
mathematical point of view. The implication of the unary minus is that
it's equivalent to a subtraction from zero:
-x <=> 0-x
In fact, in one of my more simple-minded versions of Expression, I
reacted to a leading addop by simply preloading a zero, then processing
the operator as though it were a binary operator. But a "not" is not
equivalent to an exclusive or with zero ... that would just give back
the original number. Instead, it's an exclusive or with FFFFh, or -1.
In short, the seeming parallel between the unary "not" and the unary
minus falls apart under closer scrutiny. "not" modifies the factor, not
the term, and it is not related to either the unary minus nor the
exclusive or. Therefore, it deserves a symbol to call its own. What
better symbol than the obvious one, also used by C, the '!' character?
Using the rules about the way we think the "not" should behave, we
should be able to code the exclusive or (assuming we'd ever need to), in
the very natural form:
a & !b | !a & b
Note that no parentheses are required -- the precedence levels we've
chosen automatically take care of things.
If you're keeping score on the precedence levels, this definition puts
the '!' at the top of the heap. The levels become:
1. !
2. - (unary)
3. *, /, &
4. +, -, |, ~
Looking at this list, it's certainly not hard to see why we had trouble
using '~' as the "not" symbol!
So how do we mechanize the rules? In the same way as we did with
SignedTerm, but at the factor level. We'll define a procedure
NotFactor:
{--------------------------------------------------------------}
{ Parse and Translate a Factor with Optional "Not" }
procedure NotFactor;
begin
if Look ='!' then begin
Match('!');
Factor;
Notit;
end
else
Factor;
end;
{--------------------------------------------------------------}
and call it from all the places where we formerly called Factor, i.e.,
from Term, Multiply, Divide, and _And. Note the new code generation
procedure:
{--------------------------------------------------------------}
{ Bitwise Not Primary }
procedure NotIt;
begin
EmitLn('EOR #-1,D0');
end;
{--------------------------------------------------------------}
Try this now, with a few simple cases. In fact, try that exclusive or
example,
a&!b|!a&b
You should get the code (without the comments, of course):
MOVE A(PC),DO ; load a
MOVE D0,-(SP) ; push it
MOVE B(PC),DO ; load b
EOR #-1,D0 ; not it
AND (SP)+,D0 ; and with a
MOVE D0,-(SP) ; push result
MOVE A(PC),DO ; load a
EOR #-1,D0 ; not it
MOVE D0,-(SP) ; push it
MOVE B(PC),DO ; load b
AND (SP)+,D0 ; and with !a
OR (SP)+,D0 ; or with first term
That's precisely what we'd like to get. So, at least for both
arithmetic and logical operators, our new precedence and new, slimmer
syntax hang together. Even the peculiar, but legal, expression with
leading addop:
~x
makes sense. SignedTerm ignores the leading '~', as it should, since
the expression is equivalent to:
0~x,
which is equal to x.
When we look at the BNF we've created, we find that our boolean algebra
now adds only one extra line:
<not_factor> ::= [!] <factor>
<factor> ::= <variable> | <constant> | '(' <expression> ')'
<signed_term> ::= [<addop>] <term>
<term> ::= <not_factor> (<mulop> <not_factor>)*
<expression> ::= <signed_term> (<addop> <term>)*
<assignment> ::= <variable> '=' <expression>
That's a big improvement over earlier efforts. Will our luck continue
to hold when we get to relational operators? We'll find out soon, but
it will have to wait for the next installment. We're at a good stopping
place, and I'm anxious to get this installment into your hands. It's
already been a year since the release of Installment 15. I blush to
admit that all of this current installment has been ready for almost as
long, with the exception of relational operators. But the information
does you no good at all, sitting on my hard disk, and by holding it back
until the relational operations were done, I've kept it out of your
hands for that long. It's time for me to let go of it and get it out
where you can get value from it. Besides, there are quite a number of
serious philosophical questions associated with the relational
operators, as well, and I'd rather save them for a separate installment
where I can do them justice.
Have fun with the new, leaner arithmetic and logical parsing, and I'll
see you soon with relationals.
*****************************************************************
* *
* COPYRIGHT NOTICE *
* *
* Copyright (C) 1995 Jack W. Crenshaw. All rights reserved. *
* *
* *
*****************************************************************
--
--/*\ Jon A. Lambert - TychoMUD Internet:jlsysinc at ix.netcom.com /*\--
--/*\ Mud Server Developer's Page <http://www.netcom.com/~jlsysinc> /*\--
--/*\ "Everything that deceives may be said to enchant" - Plato /*\--
More information about the mud-dev-archive
mailing list