Tutorial: Let's build a Compiler! - Part XV: Back to the Future

Jon A. Lambert jlsysinc at ix.netcom.com
Wed Mar 4 23:22:25 CET 1998


                     LET'S BUILD A COMPILER!

                                By

                     Jack W. Crenshaw, Ph.D.

                           5 March 1994


Part 15: BACK TO THE FUTURE



*****************************************************************
*                                                               *
*                        COPYRIGHT NOTICE                       *
*                                                               *
*   Copyright (C) 1994 Jack W. Crenshaw. All rights reserved.   *
*                                                               *
*****************************************************************


INTRODUCTION

Can it really have been four years since I wrote installment 
fourteen of this series?  Is it really possible that six long 
years have passed since I began it?  Funny how time flies when 
you're having fun, isn't it?  

I won't spend a lot of time making excuses; only point out that 
things happen, and priorities change.  In the four years since 
installment fourteen, I've managed to get laid off, get divorced, 
have a nervous breakdown, begin a new career as a writer, begin 
another one as a consultant, move, work on two real-time systems, 
and raise fourteen baby birds, three pigeons, six possums, and a 
duck.  For awhile there, the parsing of source code was not high 
on my list of priorities.  Neither was writing stuff for free, 
instead of writing stuff for pay.  But I do try to be faithful, 
and I do recognize and feel my responsibility to you, the reader, 
to finish what I've started.  As the tortoise said in one of my 
son's old stories, I may be slow, but I'm sure.  I'm  sure that 
there are people out there anxious to see the last reel of this 
film, and I intend to give it to them.  So, if you're one of those 
who's been waiting, more or less patiently, to see how this thing 
comes out, thanks for your patience.  I apologize for the delay.  
Let's move on.


NEW STARTS, OLD DIRECTIONS

Like many other things, programming languages and programming 
styles change with time.  In 1994, it seems a little anachronistic 
to be programming in Turbo Pascal, when the rest of the world 
seems  to have gone bananas over C++.  It also seems a little 
strange to be programming in a classical style when the rest of 
the world has switched to object-oriented methods.  Still, in 
spite of the four-year hiatus, it would be entirely too wrenching 
a change, at this point, to switch to, say, C++ with object-
orientation .  Anyway, Pascal is still not only a powerful 
programming language (more than ever, in fact), but it's a 
wonderful medium for teaching.  C is a notoriously difficult 
language to read ... it's often been accused, along with Forth, of 
being a "write-only language."  When I program in C++, I find 
myself spending at least 50% of my time struggling with language 
syntax rather than with concepts.  A stray "&" or "*" can not only 
change the functioning of the program, but its correctness as 
well.  By contrast, Pascal code is usually quite transparent and 
easy to read, even if you don't know the language. What you see is 
almost always what you get, and we can concentrate on concepts 
rather than implementation details.  I've said from the beginning 
that the purpose of this tutorial series was not to generate the 
world's fastest compiler, but to teach the fundamentals of 
compiler technology, while spending the least amount of time 
wrestling with language syntax or other aspects of software 
implementation. Finally, since a lot of what we do in this course 
amounts to software experimentation, it's important to have a 
compiler and associated environment that compiles quickly and with 
no fuss.  In my opinion, by far the most significant time measure 
in software development is the speed of the edit/compile/test 
cycle.  In this department, Turbo Pascal is king.  The compilation 
speed is blazing fast, and continues to get faster in every 
release (how do they keep doing that?).  Despite vast improvements 
in C compilation speed over the years, even Borland's fastest 
C/C++ compiler is still no match for Turbo Pascal.  Further, the 
editor built into their IDE, the make facility, and even their 
superb smart linker, all complement each other to produce a 
wonderful environment for quick turnaround.  For all of these 
reasons, I intend to stick with Pascal for the duration of this 
series. We'll be using Turbo Pascal for Windows, one of the 
compilers provided Borland Pascal with Objects, version 7.0.  If 
you don't have this compiler, don't worry ... nothing we do here 
is going to count on your having the latest version. Using the 
Windows version helps me a lot, by allowing me to use the 
Clipboard to copy code from the compiler's editor into these 
documents.  It should also help you at least as much, copying the 
code in the other direction.  

I've thought long and hard about whether or not to introduce 
objects to our discussion.  I'm a big advocate of object-oriented 
methods for all uses, and such methods definitely have their place 
in compiler technology.  In fact, I've written papers on just this 
subject (Refs. 1-3).  But the architecture of a compiler which is 
based on object-oriented approaches is vastly different than that 
of the more classical compiler we've been building.  Again, it 
would seem to be entirely too much to change these horses in mid-
stream.  As I said, programming styles change.  Who knows, it may 
be another six years before we finish this thing, and if we keep 
changing the code every time programming style changes, we may 
NEVER finish.

So for now, at least, I've determined to continue the classical 
style in Pascal, though we might indeed discuss objects and object 
orientation as we go.  Likewise, the target machine will remain 
the Motorola 68000 family.  Of all the decisions to be made here, 
this one has been the easiest.  Though I know that many of you 
would like to see code for the 80x86, the 68000 has become, if 
anything, even more popular as a platform for embedded systems, 
and it's to that application that this whole effort began in the 
first place.  Compiling for the PC, MSDOS platform, we'd have to 
deal with all the issues of DOS system calls, DOS linker formats, 
the PC file system and hardware, and all those other complications 
of a DOS environment.  An embedded system, on the other hand, must 
run standalone, and it's for this kind of application, as an 
alternative to assembly language, that I've always imagined that a 
language like KISS would thrive. Anyway, who wants to deal with 
the 80x86 architecture if they don't have to?

The one feature of Turbo Pascal that I'm going to be making heavy 
use of is units.  In the past, we've had to make compromises 
between code size and complexity, and program functionality.  A 
lot of our work has been in the nature of computer 
experimentation, looking at only one aspect of compiler technology 
at a time. We did this to avoid to avoid having to carry around 
large programs, just to investigate simple concepts.  In the 
process, we've re-invented the wheel and re-programmed the same 
functions more times than I'd like to count.  Turbo units provide 
a wonderful way to get functionality and simplicity at the same 
time:  You write reusable code, and invoke it with a single line.  
Your test program stays small, but it can do powerful things.

One feature of Turbo Pascal units is their initialization block.  
As with an Ada package, any code in the main begin-end block of a 
unit gets executed as the program is initialized.  As you'll see 
later, this sometimes gives us neat simplifications in the code.  
Our procedure Init, which has been with us since Installment 1, 
goes away entirely when we use units.  The various routines in the 
Cradle, another key features of our approach, will get distributed 
among the units.

The concept of units, of course, is no different than that of C 
modules.  However, in C (and C++), the interface between modules 
comes via preprocessor include statements and header files.  As 
someone who's had to read a lot of other people's C programs, I've 
always found this rather bewildering.  It always seems that 
whatever data structure you'd like to know about is in some other 
file.  Turbo units are simpler for the very reason that they're 
criticized by some:  The function interfaces and their 
implementation are included in the same file.  While this 
organization may create problems with code security, it also 
reduces the number of files by half, which isn't half bad.  
Linking of the object files is also easy, because the Turbo 
compiler takes care of it without the need for make files or other 
mechanisms.


STARTING OVER?

Four years ago, in Installment 14, I promised you that our days of 
re-inventing the wheel, and recoding the same software over and 
over for each lesson, were over, and that from now on we'd stick 
to more complete programs that we would simply add new features 
to.  I still intend to keep that promise; that's one of the main 
purposes for using units.  However, because of the long time since 
Installment 14, it's natural to want to at least do some review, 
and anyhow, we're going to have to make rather sweeping changes in 
the code to make the transition to units.  Besides, frankly, after 
all this time I can't remember all the neat ideas I had in my head 
four years ago.  The best way for me to recall them is to retrace 
some of the steps we took to arrive at Installment 14.  So I hope 
you'll be understanding and bear with me as we go back to our 
roots, in a sense, and rebuild the core of the software, 
distributing the routines among the various units, and 
bootstrapping ourselves back up to the point we were at lo, those 
many moons ago. As has always been the case, you're going to get  
to see me make all the mistakes and execute changes of direction, 
in real time.  Please bear with me ... we'll start getting to the 
new stuff before you know it.

Since we're going to be using multiple modules in our new 
approach, we have to address the issue of file management.  If 
you've followed all the other sections of this tutorial, you know 
that, as our programs evolve, we're going to be replacing older, 
more simple-minded units with more capable ones. This brings us to 
an issue of version control. There will almost certainly be times 
when we will overlay a simple file (unit), but later wish we had 
the simple one again.  A case in point is embodied in our 
predilection for using single-character variable names, keywords, 
etc., to test concepts without getting bogged down in the details 
of a lexical scanner.  Thanks to the use of units, we will be 
doing much less of this in the future.  Still, I not only suspect, 
but am certain that we will need to save some older versions of 
files, for special purposes, even though they've been replaced by 
newer, more capable ones.

To deal with this problem, I suggest that you create different 
directories, with different versions of the units as needed.  If 
we do this properly, the code in each directory will remain self-
consistent.  I've tentatively created four directories:  SINGLE 
(for single-character experimentation), MULTI (for, of course, 
multi-character versions), TINY, and KISS.

Enough said about philosophy and details.  Let's get on with the 
resurrection of the software.


THE INPUT UNIT

A key concept that we've used since Day 1 has been the idea of an 
input stream with one lookahead character.  All the parsing 
routines examine this character, without changing it, to decide 
what they should do next.  (Compare this approach with the C/Unix 
approach using getchar and unget, and I think you'll agree that 
our approach is simpler). We'll begin our hike into the future by 
translating this concept into our new, unit-based organization.  
The first unit, appropriately called Input, is shown below:


{--------------------------------------------------------------}
unit Input;
{--------------------------------------------------------------}
interface
var Look: char;               { Lookahead character }
procedure GetChar;            { Read new character  }

{--------------------------------------------------------------}
implementation

{--------------------------------------------------------------}
{ Read New Character From Input Stream }

procedure GetChar;
begin
 Read(Look);
end;

{--------------------------------------------------------------}
{ Unit Initialization }
begin
 GetChar;
end.
{--------------------------------------------------------------}


As you can see, there's nothing very profound, and certainly 
nothing complicated, about this unit, since it consists of only a 
single procedure.  But already, we can see how the use of units 
gives us advantages.  Note the executable code in the 
initialization block.  This code "primes the pump" of the input 
stream for us, something we've always had to do before, by 
inserting the call to GetChar in line, or in procedure Init.  This 
time, the call happens without any special reference to it on our 
part, except within the unit itself. As I predicted earlier, this 
mechanism is going to make our lives much simpler as we proceed.
I consider it to be one of the most useful features of Turbo 
Pascal, and I lean on it heavily. 

Copy this unit into your compiler's IDE, and compile it. To test 
the software, of course, we always need a main program.  I used 
the following, really complex test program, which we'll later 
evolve into the Main for our compiler:


{--------------------------------------------------------------}
program Main;
uses WinCRT, Input;
begin
 WriteLn(Look);
end.
{--------------------------------------------------------------}


Note the use of the Borland-supplied unit, WinCRT.  This unit is 
necessary if you intend to use the standard Pascal I/O routines, 
Read, ReadLn, Write, and WriteLn, which of course we intend to do.
If you forget to include this unit in the "uses" clause, you will 
get a really bizarre and indecipherable error message at run time.

Note also that we can access the lookahead character, even though 
it's not declared in the main program.  All variables declared 
within the interface section of a unit are global, but they're 
hidden from prying eyes; to that extent, we get a modicum of 
information hiding.  Of course, if we were writing in an object-
oriented fashion, we should not allow outside modules to access 
the units internal variables.  But, although Turbo units have a 
lot in common with objects, we're not doing object-oriented design 
or code here, so our use of Look is appropriate.

Go ahead and save the test program as Main.pas.  To make life 
easier as we get more and more files, you might want to take this 
opportunity to declare this file as the compiler's Primary file.  
That way, you can execute the program from any file.  Otherwise, 
if you press Cntl-F9 to compile and run from one of the units, 
you'll get an error message.  You set the primary file using the 
main submenu, "Compile," in the Turbo IDE.

I hasten to point out, as I've done before, that the function of 
unit Input is, and always has been, considered to be a dummy 
version of the real thing.  In a production version of a compiler, 
the input stream will, of course, come from a file rather than 
from the keyboard.  And it will almost certainly include line 
buffering, at the very least, and more likely, a rather large text 
buffer to support efficient disk I/O.  The nice part about the 
unit approach is that, as with objects, we can modify the code in 
the unit to be as simple or as sophisticated as we like. As long 
as the interface, as embodied in the public procedures and the 
lookahead character, don't change, the rest of the program is 
totally unaffected.  And since units are compiled, rather than 
merely included, the time required to link with them is virtually 
nil.  Again, the result is that we can get all the benefits of 
sophisticated implementations, without having to carry the code 
around as so much baggage.

In later installments, I intend to provide a full-blown IDE for 
the KISS compiler, using a true Windows application generated by 
Borland's OWL applications framework.  For now, though, we'll obey 
my #1 rule to live by:  Keep It Simple.



THE OUTPUT UNIT

Of course, every decent program should have output, and ours is no 
exception.  Our output routines included the Emit functions.  The 
code for the corresponding output unit is shown next:


{--------------------------------------------------------------}
unit Output;
{--------------------------------------------------------------}
interface
procedure Emit(s: string);   { Emit an instruction  }
procedure EmitLn(s: string);  { Emit an instruction line }

{--------------------------------------------------------------}
implementation
const TAB = ^I;

{--------------------------------------------------------------}
{ Emit an Instruction }

procedure Emit(s: string);
begin
 Write(TAB, s);
end;

{--------------------------------------------------------------}
{ Emit an Instruction, Followed By a Newline }

procedure EmitLn(s: string);
begin
 Emit(s);
 WriteLn;
end;

end.
{--------------------------------------------------------------}


(Notice that this unit has no initialization clause, so it needs 
no begin-block.)
 
Test this unit with the following main program:

{--------------------------------------------------------------}
program Test;
uses WinCRT, Input, Output, Scanner, Parser;
begin
 WriteLn('MAIN:");
 EmitLn('Hello, world!');
end.
{--------------------------------------------------------------}

Did you see anything that surprised you?  You may have been 
surprised to see that you needed to type something, even though 
the main program requires no input.  That's because of the 
initialization in unit Input, which still requires something to 
put into the lookahead character.  Sorry, there's no way out of 
that box, or rather, we don't _WANT_ to get out. Except for simple 
test cases such as this, we will always want a valid lookahead 
character, so the right thing to do about this "problem" is ... 
nothing.

Perhaps more surprisingly, notice that the TAB character had no 
effect; our line of "instructions" begins at column 1, same as the 
fake label.  That's right:  WinCRT doesn't support tabs. We have a 
problem.

There are a few ways we can deal with this problem. The one thing 
we can't do is to simply ignore it.  Every assembler I've ever 
used reserves column 1 for labels, and will rebel to see 
instructions starting there.  So, at the very least, we must space 
the instructions over one column to keep the assembler happy.  .  
That's easy enough to do:  Simply change, in procedure Emit, the 
line:

 Write(TAB, s);
 
by:

 Write(' ', s);

I must admit that I've wrestled with this problem before, and find 
myself changing my mind as often as a chameleon changes color.  
For the purposes we're going to be using, 99% of which will be 
examining the output code as it's displayed on a CRT, it would be 
nice to see neatly blocked out "object" code.  The line:

SUB1:  MOVE #4,D0

just plain looks neater than the different, but functionally 
identical code,

SUB1:
 MOVE #4,D0

In test versions of my code, I included a more sophisticated 
version of the procedure PostLabel, that avoids having labels on 
separate lines, but rather defers the printing of a label so it 
can end up on the same line as the associated instruction.  As 
recently as an hour ago, my version of unit Output provided full 
support for tabs, using an internal column count variable and 
software to manage it.  I had, if I do say so myself, some rather 
elegant code to support the tab mechanism, with a minimum of code 
bloat. It was awfully tempting to show you the "prettyprint" 
version, if for no other reason than to show off the elegance.

Nevertheless, the code of the "elegant" version was considerably 
more complex and larger.  Since then, I've had second thoughts. In 
spite of our desire to see pretty output, the inescapable fact is 
that the two versions of the MAIN: code fragment shown above are 
functionally identical; the assembler, which is the ultimate 
destination of the code, couldn't care less which version it gets, 
except that the prettier version will contain more characters, 
therefore will use more disk space and take longer to assemble.   
but the prettier one not only takes more code to generate, but 
will create a larger output file, with many more space characters 
than the minimum needed.  When you look at it that way, it's not 
very hard to decide which approach to use, is it?

What finally clinched the issue for me was a reminder to consider 
my own first commandment: KISS.  Although I was pretty proud of 
all my elegant little tricks to implement tabbing, I had to remind 
myself that, to paraphrase Senator Barry Goldwater, elegance in 
the pursuit of complexity is no virtue.  Another wise man once 
wrote, "Any idiot can design a Rolls-Royce. It takes a genius to 
design a VW."  So the elegant, tab-friendly version of Output is 
history, and what you see is the simple, compact, VW version.


THE ERROR UNIT

Our next set of routines are those that handle errors.  To refresh 
your memory, we take the approach, pioneered by Borland in Turbo 
Pascal, of halting on the first error.  Not only does this greatly 
simplify our code, by completely avoiding the sticky issue of 
error recovery, but it also makes much more sense, in my opinion, 
in an interactive environment.  I know this may be an extreme 
position, but I consider the practice of reporting all errors in a 
program to be an anachronism, a holdover from the days of batch 
processing.  It's time to scuttle the practice.  So there.

In our original Cradle, we had two error-handling procedures: 
Error, which didn't halt, and Abort, which did.  But I don't think 
we ever found a use for the procedure that didn't halt, so in the 
new, lean and mean unit Errors, shown next, procedure Error takes 
the place of Abort.


{--------------------------------------------------------------}
unit Errors;
{--------------------------------------------------------------}
interface
procedure Error(s: string);
procedure Expected(s: string);

{--------------------------------------------------------------}
implementation

{--------------------------------------------------------------}
{ Write error Message and Halt }

procedure Error(s: string);
begin
 WriteLn;
 WriteLn(^G, 'Error: ', s, '.');
 Halt;
end;

{--------------------------------------------------------------}
{ Write "<something> Expected" }

procedure Expected(s: string);
begin
 Error(s + ' Expected');
end;

end.
{--------------------------------------------------------------}


As usual, here's a test program:




{--------------------------------------------------------------}
program Test;
uses WinCRT, Input, Output, Errors;

begin
 Expected('Integer');
end.
{--------------------------------------------------------------}

Have you noticed that the "uses" line in our main program keeps 
getting longer?  That's OK. In the final version, the main program 
will only call procedures in our parser, so its use clause will 
only have a couple of entries. But for now, it's probably best to 
include all the units so we can test procedures in them.


SCANNING AND PARSING

The classical compiler architecture consists of separate modules 
for the lexical scanner, which supplies tokens in the language, 
and the parser, which tries to make sense of the tokens as syntax 
elements.  If you can still remember what we did in earlier 
installments, you'll recall that we didn't do things that way.  
Because we're using a predictive parser, we can almost always tell 
what language element is coming next, just by examining the 
lookahead character.  Therefore, we found no need to prefetch 
tokens, as a scanner would do.

But, even though there is no functional procedure called 
"Scanner," it still makes sense to separate the scanning functions 
from the parsing functions.  So I've created two more units 
called, amazingly enough, Scanner and Parser.  The Scanner unit 
contains all of the routines known as recognizers.  Some of these, 
such as IsAlpha, are pure boolean routines which operate on the 
lookahead character only.  The other routines are those which 
collect tokens, such as identifiers and numeric constants. The 
Parser unit will contain all of the routines making up the 
recursive-descent parser.  The general rule should be that unit 
Parser contains all of the information that is language-specific; 
in other words, the syntax of the language should be wholly 
contained in Parser.  In an ideal world, this rule should be true 
to the extent that we can change the compiler to compile a 
different language, merely by replacing the single unit, Parser. 

In practice, things are almost never this pure.  There's always a 
small amount of "leakage" of syntax rules into the scanner as 
well.  For example, the rules concerning what makes up a legal 
identifier or constant may vary from language to language.  In 
some languages, the rules concerning comments permit them to be 
filtered by the scanner, while in others they do not. So in 
practice, both units are likely to end up having language-
dependent components, but the changes required to the scanner 
should be relatively trivial. 

Now, recall that we've used two versions of the scanner routines: 
One that handled only single-character tokens, which we used for a 
number of our tests, and another that provided full support for 
multi-character tokens.  Now that we have our software separated 
into units, I don't anticipate getting much use out of the single-
character version, but it doesn't cost us much to provide for 
both.  I've created two versions of the Scanner unit.  The first 
one, called Scanner1, contains the single-digit version of the 
recognizers:


{--------------------------------------------------------------}
unit Scanner1;
{--------------------------------------------------------------}
interface
uses Input, Errors;

function IsAlpha(c: char): boolean;
function IsDigit(c: char): boolean;
function IsAlNum(c: char): boolean;
function IsAddop(c: char): boolean;
function IsMulop(c: char): boolean;

procedure Match(x: char);
function GetName: char;
function GetNumber: char;

{--------------------------------------------------------------}
implementation

{--------------------------------------------------------------}
{ Recognize an Alpha Character }

function IsAlpha(c: char): boolean;
begin
 IsAlpha := UpCase(c) in ['A'..'Z'];
end;

{--------------------------------------------------------------}
{ Recognize a Numeric Character }

function IsDigit(c: char): boolean;
begin
 IsDigit := c in ['0'..'9'];
end;

{--------------------------------------------------------------}
{ Recognize an Alphanumeric Character }

function IsAlnum(c: char): boolean;
begin
 IsAlnum := IsAlpha(c) or IsDigit(c);
end;

{--------------------------------------------------------------}
{ Recognize an Addition Operator }

function IsAddop(c: char): boolean;
begin
 IsAddop := c in ['+','-'];
end;

{--------------------------------------------------------------}
{ Recognize a Multiplication Operator }

function IsMulop(c: char): boolean;
begin
 IsMulop := c in ['*','/'];
end;

{--------------------------------------------------------------}
{ Match One Character }

procedure Match(x: char);
begin
 if Look = x then GetChar
 else Expected('''' + x + '''');
end;

{--------------------------------------------------------------}
{ Get an Identifier }

function GetName: char;
begin
 if not IsAlpha(Look) then Expected('Name');
 GetName := UpCase(Look);
 GetChar;
end;

{--------------------------------------------------------------}
{ Get a Number }

function GetNumber: char;
begin
 if not IsDigit(Look) then Expected('Integer');
 GetNumber := Look;
 GetChar;
end;

end.
{--------------------------------------------------------------}


The following code fragment of the main program provides a good 
test of the scanner.  For brevity, I'll only include the 
executable code here; the rest remains the same.  Don't forget, 
though, to add the name Scanner1 to the "uses" clause.

 Write(GetName);
 Match('=');
 Write(GetNumber);
 Match('+');
 WriteLn(GetName);

This code will recognize all sentences of the form:

 x=0+y

where x and y can be any single-character variable names, and 0 
any digit.  The code should reject all other sentences, and give a 
meaningful error message. If it did, you're in good shape and we 
can proceed.


THE SCANNER UNIT

The next, and by far the most important, version of the scanner is 
the one that handles the multi-character tokens that all real 
languages must have.  Only the two functions, GetName and 
GetNumber, change between the two units, but just to be sure there 
are no mistakes, I've reproduced the entire unit here.  This is 
unit Scanner:


{--------------------------------------------------------------}
unit Scanner;
{--------------------------------------------------------------}
interface
uses Input, Errors;

function IsAlpha(c: char): boolean;
function IsDigit(c: char): boolean;
function IsAlNum(c: char): boolean;
function IsAddop(c: char): boolean;
function IsMulop(c: char): boolean;

procedure Match(x: char);
function GetName: string;
function GetNumber: longint;

{--------------------------------------------------------------}
implementation

{--------------------------------------------------------------}
{ Recognize an Alpha Character }

function IsAlpha(c: char): boolean;
begin
 IsAlpha := UpCase(c) in ['A'..'Z'];
end;

{--------------------------------------------------------------}
{ Recognize a Numeric Character }

function IsDigit(c: char): boolean;
begin
 IsDigit := c in ['0'..'9'];
end;

{--------------------------------------------------------------}
{ Recognize an Alphanumeric Character }

function IsAlnum(c: char): boolean;
begin
 IsAlnum := IsAlpha(c) or IsDigit(c);
end;

{--------------------------------------------------------------}
{ Recognize an Addition Operator }

function IsAddop(c: char): boolean;
begin
 IsAddop := c in ['+','-'];
end;

{--------------------------------------------------------------}
{ Recognize a Multiplication Operator }

function IsMulop(c: char): boolean;
begin
 IsMulop := c in ['*','/'];
end;

{--------------------------------------------------------------}
{ Match One Character }

procedure Match(x: char);
begin
 if Look = x then GetChar
 else Expected('''' + x + '''');
end;

{--------------------------------------------------------------}
{ Get an Identifier }

function GetName: string;
var n: string;
begin
 n := '';
 if not IsAlpha(Look) then Expected('Name');
 while IsAlnum(Look) do begin
  n := n + Look;
  GetChar;
 end;
 GetName := n;
end;

{--------------------------------------------------------------}
{ Get a Number }

function GetNumber: string;
var n: string;
begin
 n := '';
 if not IsDigit(Look) then Expected('Integer');
 while IsDigit(Look) do begin
  n := n + Look;
  GetChar;
 end;
 GetNumber := n;
end;

end.
{--------------------------------------------------------------}


The same test program will test this scanner, also. Simply change 
the "uses" clause to use Scanner instead of Scanner1.  Now you 
should be able to type multi-character names and numbers.


DECISIONS, DECISIONS

In spite of the relative simplicity of both scanners, a lot of 
thought has gone into them, and a lot of decisions had to be made.  
I'd like to share those thoughts with you now so you can make your 
own educated decision, appropriate for your application.  First, 
note that both versions of GetName translate the input characters 
to upper case.  Obviously, there was a design decision made here, 
and this is one of those cases where the language syntax splatters 
over into the scanner.  In the C language, the case of characters 
in identifiers is significant.  For such a language, we obviously 
can't map the characters to upper case.  The design I'm using 
assumes a language like Pascal, where the case of characters 
doesn't matter.  For such languages, it's easier to go ahead and 
map all identifiers to upper case in the scanner, so we don't have 
to worry later on when we're comparing strings for equality.

We could have even gone a step further, and map the characters to 
upper case right as they come in, in GetChar.  This approach works 
too, and I've used it in the past, but it's too confining. 
Specifically, it will also map characters that may be part of 
quoted strings, which is not a good idea.  So if you're going to 
map to upper case at all, GetName is the proper place to do it.

Note that the function GetNumber in this scanner returns a string, 
just as GetName does.  This is another one of those things I've 
oscillated about almost daily, and the last swing was all of ten 
minutes ago.  The alternative approach, and one I've used many 
times in past installments, returns an integer result.

Both approaches have their good points. Since we're fetching a 
number, the approach that immediately comes to mind is to return 
it as an integer.  But bear in mind that the eventual use of the 
number will be in a write statement that goes back to the outside 
world.  Someone -- either us or the code hidden inside the write 
statement -- is going to have to convert the number back to a 
string again.  Turbo Pascal includes such string conversion 
routines, but why use them if we don't have to?  Why convert a 
number from string to integer form, only to convert it right back 
again in the code generator, only a few statements later?

Furthermore, as you'll soon see, we're going to need a temporary 
storage spot for the value of the token we've fetched. If we treat 
the number in its string form, we can store the value of either a 
variable or a number in the same string.  Otherwise, we'll have to 
create a second, integer variable.

On the other hand, we'll find that carrying the number as a string 
virtually eliminates any chance of optimization later on.  As we 
get to the point where we are beginning to concern ourselves with 
code generation, we'll encounter cases in which we're doing 
arithmetic on constants.  For such cases, it's really foolish to 
generate code that performs the constant arithmetic at run time.  
Far better to let the parser do the arithmetic at compile time, 
and merely code the result.  To do that, we'll wish we had the 
constants stored as integers rather than strings.

What finally swung me back over to the string approach was an 
aggressive application of the KISS test, plus reminding myself 
that we've studiously avoided issues of code efficiency.  One of 
the things that makes our simple-minded parsing work, without the 
complexities of a "real" compiler, is that we've said up front 
that we aren't concerned about code efficiency.  That gives us a 
lot of freedom to do things the easy way rather than the efficient 
one, and it's a freedom we must be careful not to abandon 
voluntarily, in spite of the urges for efficiency shouting in our 
ear.  In addition to being a big believer in the KISS philosophy, 
I'm also an advocate of "lazy programming," which in this context 
means, don't program anything until you need it.  As P.J. Plauger 
says, "Never put off until tomorrow what you can put off 
indefinitely."  Over the years, much code has been written to 
provide for eventualities that never happened.  I've learned that 
lesson myself, from bitter experience.  So the bottom line is:  We 
won't convert to an integer here because we don't need to.  It's 
as simple as that.

For those of you who still think we may need the integer version 
(and indeed we may), here it is:


{--------------------------------------------------------------}
{ Get a Number (integer version) }

function GetNumber: longint;
var n: longint;
begin
 n := 0;
 if not IsDigit(Look) then Expected('Integer');
 while IsDigit(Look) do begin
  n := 10 * n + (Ord(Look) - Ord('0'));
  GetChar;
 end;
 GetNumber := n;
end;
{--------------------------------------------------------------}

You might file this one away, as I intend to, for a rainy day.


PARSING

At this point, we have distributed all the routines that made up 
our Cradle into units that we can draw upon as we need them.  
Obviously, they will evolve further as we continue the process of 
bootstrapping ourselves up again, but for the most part their 
content, and certainly the architecture that they imply, is 
defined.  What remains is to embody the language syntax into the 
parser unit.  We won't do much of that in this installment, but I 
do want to do a little, just to leave us with the good feeling 
that we still know what we're doing.  So before we go, let's 
generate just enough of a parser to process single factors in an 
expression.  In the process, we'll also, by necessity, find we 
have created a code generator unit, as well.

Remember the very first installment of this series?  We read an 
integer value, say n, and generated the code to load it into the 
D0 register via an immediate move:

 MOVE #n,D0

Shortly afterwards, we repeated the process for a variable, 

 MOVE X(PC),D0

and then for a factor that could be either constant or variable.
For old times sake, let's revisit that process.  Define the 
following new unit:


{--------------------------------------------------------------}
unit Parser;
{--------------------------------------------------------------}
interface
uses Input, Scanner, Errors, CodeGen;
procedure Factor;

{--------------------------------------------------------------}
implementation

{--------------------------------------------------------------}
{ Parse and Translate a Factor }

procedure Factor;
begin
 LoadConstant(GetNumber);
end;

end.
{--------------------------------------------------------------}


As you can see, this unit calls a procedure, LoadConstant, which 
actually effects the output of the assembly-language code.  The 
unit also uses a new unit, CodeGen.  This step represents the last 
major change in our architecture, from earlier installments: The 
removal of the machine-dependent code to a separate unit. If I 
have my way, there will not be a single line of code, outside of 
CodeGen, that betrays the fact that we're targeting the 68000 CPU.  
And this is one place I think that having my way is quite 
feasible.  

For those of you who wish I were using the 80x86 architecture (or 
any other one) instead of the 68000, here's your answer:  Merely 
replace CodeGen with one suitable for your CPU of choice.

So far, our code generator has only one procedure in it.  Here's 
the unit:


{--------------------------------------------------------------}
unit CodeGen;

{--------------------------------------------------------------}
interface
uses Output;
procedure LoadConstant(n: string);

{--------------------------------------------------------------}
implementation

{--------------------------------------------------------------}
{ Load the Primary Register with a Constant }

procedure LoadConstant(n: string);
begin
 EmitLn('MOVE #' + n + ',D0' );
end;

end.
{--------------------------------------------------------------}


Copy and compile this unit, and execute the following main 
program:

{--------------------------------------------------------------}
program Main;
uses WinCRT, Input, Output, Errors, Scanner, Parser;
begin
 Factor;
end.
{--------------------------------------------------------------}


There it is, the generated code, just as we hoped it would be.

Now, I hope you can begin to see the advantage of the unit-based 
architecture of our new design.  Here we have a main program 
that's all of five lines long. That's all of the program we need 
to see, unless we choose to see more.  And yet, all those units 
are sitting there, patiently waiting to serve us.  We can have our 
cake and eat it too, in that we have simple and short code, but 
powerful allies.  What remains to be done is to flesh out the 
units to match the capabilities of earlier installments.  We'll do 
that in the next installment, but before I close, let's finish out 
the parsing of a factor, just to satisfy ourselves that we still 
know how.  The final version of CodeGen includes the new 
procedure, LoadVariable:

{--------------------------------------------------------------}
unit CodeGen;

{--------------------------------------------------------------}
interface
uses Output;
procedure LoadConstant(n: string);
procedure LoadVariable(Name: string);

{--------------------------------------------------------------}
implementation

{--------------------------------------------------------------}
{ Load the Primary Register with a Constant }

procedure LoadConstant(n: string);
begin
 EmitLn('MOVE #' + n + ',D0' );
end;

{--------------------------------------------------------------}
{ Load a Variable to the Primary Register }

procedure LoadVariable(Name: string);
begin
 EmitLn('MOVE ' + Name + '(PC),D0');
end;


end.
{--------------------------------------------------------------}


The parser unit itself doesn't change, but we have a more complex 
version of procedure Factor:

{--------------------------------------------------------------}
{ Parse and Translate a Factor }

procedure Factor;
begin
 if IsDigit(Look) then
  LoadConstant(GetNumber)
 else if IsAlpha(Look)then
  LoadVariable(GetName)
 else
  Error('Unrecognized character ' + Look);
end;
{--------------------------------------------------------------}

 
Now, without altering the main program, you should find that our 
program will process either a variable or a constant factor.  At 
this point, our architecture is almost complete; we have units to 
do all the dirty work, and enough code in the parser and code 
generator to demonstrate that everything works.  What remains is 
to flesh out the units we've defined, particularly the parser and 
code generator, to support the more complex syntax elements that 
make up a real language.  Since we've done this many times before 
in earlier installments, it shouldn't take long to get us back to 
where we were before the long hiatus.  We'll continue this process 
in Installment 16, coming soon.  See you then.



REFERENCES

1. Crenshaw, J.W., "Object-Oriented Design of Assemblers and 
Compilers," Proc. Software Development '91 Conference, Miller 
Freeman, San Francisco, CA, February 1991, pp. 143-155.

2. Crenshaw, J.W., "A Perfect Marriage," Computer Language, Volume 
8, #6, June 1991, pp. 44-55.

3. Crenshaw, J.W., "Syntax-Driven Object-Oriented Design," Proc. 
1991 Embedded Systems Conference, Miller Freeman, San 
Francisco, CA, September 1991, pp. 45-60.


*****************************************************************
*                                                               *
*                        COPYRIGHT NOTICE                       *
*                                                               *
*   Copyright (C) 1994 Jack W. Crenshaw. All rights reserved.   *  
*                                                               *
*                                                               *
*****************************************************************

--
--/*\ Jon A. Lambert - TychoMUD     Internet:jlsysinc at ix.netcom.com /*\--
--/*\ Mud Server Developer's Page <http://www.netcom.com/~jlsysinc> /*\--
--/*\   "Everything that deceives may be said to enchant" - Plato   /*\--





More information about the mud-dev-archive mailing list