Tutorial: Let's build a Compiler! - Part III: More Expressions

Jon A. Lambert Jon.A.Lambert at ix.netcom.com
Fri Feb 27 18:27:27 CET 1998


                     LET'S BUILD A COMPILER!

                                By

                     Jack W. Crenshaw, Ph.D.

                            4 Aug 1988


                    Part III: MORE EXPRESSIONS


*****************************************************************
*                                                               *
*                        COPYRIGHT NOTICE                       *
*                                                               *
*   Copyright (C) 1988 Jack W. Crenshaw. All rights reserved.   *
*                                                               *
*****************************************************************


INTRODUCTION

In the last installment, we examined the techniques used to parse
and  translate a general math expression.  We  ended  up  with  a
simple parser that  could handle arbitrarily complex expressions,
with two restrictions:

  o No variables were allowed, only numeric factors

  o The numeric factors were limited to single digits

In this installment, we'll get  rid of those restrictions.  We'll
also extend what  we've  done  to  include  assignment statements
function  calls  and.    Remember,   though,   that   the  second
restriction was  mainly self-imposed  ... a choice of convenience
on our part, to make life easier and to let us concentrate on the
fundamental concepts.    As  you'll  see  in  a bit, it's an easy
restriction to get rid of, so don't get  too  hung  up  about it.
We'll use the trick when it serves us to do so, confident that we
can discard it when we're ready to.


VARIABLES

Most expressions  that we see in practice involve variables, such
as

               b * b + 4 * a * c

No  parser is much good without being able  to  deal  with  them.
Fortunately, it's also quite easy to do.

Remember that in our parser as it currently stands, there are two
kinds of  factors  allowed:  integer  constants  and  expressions
within parentheses.  In BNF notation,

     <factor> ::= <number> | (<expression>)

The '|' stands for "or", meaning of course that either form  is a
legal form for a factor.   Remember,  too, that we had no trouble
knowing which was which  ...  the  lookahead  character is a left
paren '(' in one case, and a digit in the other.
                              
It probably won't come as too much of a surprise that  a variable
is just another kind of factor.    So  we extend the BNF above to
read:


     <factor> ::= <number> | (<expression>) | <variable>


Again, there is no  ambiguity:  if  the  lookahead character is a
letter,  we  have  a variable; if a digit, we have a number. Back
when we translated the number, we just issued code  to  load  the
number,  as immediate data, into D0.  Now we do the same, only we
load a variable.

A minor complication in the  code generation arises from the fact
that most  68000 operating systems, including the SK*DOS that I'm
using, require the code to be  written  in "position-independent"
form, which  basically means that everything is PC-relative.  The
format for a load in this language is

               MOVE X(PC),D0

where X is, of course, the variable name.  Armed with that, let's
modify the current version of Factor to read:


{---------------------------------------------------------------}
{ Parse and Translate a Math Factor }

procedure Expression; Forward;

procedure Factor;
begin
   if Look = '(' then begin
      Match('(');
      Expression;
      Match(')');
      end
   else if IsAlpha(Look) then
      EmitLn('MOVE ' + GetName + '(PC),D0')
   else
      EmitLn('MOVE #' + GetNum + ',D0');
end;
{--------------------------------------------------------------}


I've  remarked before how easy it is to  add  extensions  to  the
parser, because of  the  way  it's  structured.  You can see that
this  still  holds true here.  This time it cost us  all  of  two
extra lines of code.  Notice, too, how the if-else-else structure
exactly parallels the BNF syntax equation.

OK, compile and test this new version of the parser.  That didn't
hurt too badly, did it?
                              

FUNCTIONS

There is only one  other  common kind of factor supported by most
languages: the function call.  It's really too early  for  us  to
deal with functions well,  because  we  haven't yet addressed the
issue of parameter passing.  What's more, a "real" language would
include a mechanism to  support  more than one type, one of which
should be a function type.  We haven't gotten there  yet, either.
But I'd still like to deal with functions  now  for  a  couple of
reasons.    First,  it  lets  us  finally  wrap  up the parser in
something very close to its final form, and second, it  brings up
a new issue which is very much worth talking about.

Up  till  now,  we've  been  able  to  write  what  is  called  a
"predictive parser."  That  means  that at any point, we can know
by looking at the current  lookahead character exactly what to do
next.  That isn't the case when we add functions.  Every language
has some naming rules  for  what  constitutes a legal identifier.
For the present, ours is simply that it  is  one  of  the letters
'a'..'z'.  The  problem  is  that  a variable name and a function
name obey  the  same  rules.   So how can we tell which is which?
One way is to require that they each be declared before  they are
used.    Pascal  takes that approach.  The other is that we might
require a function to be followed by a (possibly empty) parameter
list.  That's the rule used in C.

Since  we  don't  yet have a mechanism for declaring types, let's
use the C  rule for now.  Since we also don't have a mechanism to
deal  with parameters, we can only handle  empty  lists,  so  our
function calls will have the form

                    x()  .

Since  we're  not  dealing  with  parameter lists yet,  there  is
nothing  to do but to call the function, so we need only to issue
a BSR (call) instead of a MOVE.

Now that there are two  possibilities for the "If IsAlpha" branch
of the test in Factor, let's treat them in a  separate procedure.
Modify Factor to read:


{---------------------------------------------------------------}
{ Parse and Translate a Math Factor }

procedure Expression; Forward;

procedure Factor;
begin
   if Look = '(' then begin
      Match('(');
      Expression;
      Match(')');
      end
   else if IsAlpha(Look) then
      Ident
   else
      EmitLn('MOVE #' + GetNum + ',D0');
end;
{--------------------------------------------------------------}


and insert before it the new procedure


{---------------------------------------------------------------}
{ Parse and Translate an Identifier }

procedure Ident;
var Name: char;
begin
   Name := GetName;
   if Look = '(' then begin
      Match('(');
      Match(')');
      EmitLn('BSR ' + Name);
      end
   else
      EmitLn('MOVE ' + Name + '(PC),D0')
end;
{---------------------------------------------------------------}


OK, compile and  test  this  version.  Does  it  parse  all legal
expressions?  Does it correctly flag badly formed ones?

The important thing to notice is that even though  we  no  longer
have  a predictive parser, there is  little  or  no  complication
added with the recursive descent approach that we're  using.   At
the point where  Factor  finds an identifier (letter), it doesn't
know whether it's a variable name or a function name, nor does it
really care.  It simply passes it on to Ident and leaves it up to
that procedure to figure it out.  Ident, in  turn,  simply  tucks
away the identifier and then reads one more  character  to decide
which kind of identifier it's dealing with.

Keep this approach in mind.  It's a very powerful concept, and it
should be used  whenever  you  encounter  an  ambiguous situation
requiring further lookahead.   Even  if  you  had to look several
tokens ahead, the principle would still work.


MORE ON ERROR HANDLING

As long as we're talking  philosophy,  there's  another important
issue to point out:  error  handling.    Notice that although the
parser correctly rejects (almost)  every malformed  expression we
can  throw at it, with a meaningful  error  message,  we  haven't
really had to  do much work to make that happen.  In fact, in the
whole parser per se (from  Ident  through  Expression)  there are
only two calls to the error routine, Expected.  Even those aren't
necessary ... if you'll look again in Term and Expression, you'll
see that those statements can't be reached.  I put them  in early
on as a  bit  of  insurance,  but  they're no longer needed.  Why
don't you delete them now?

So how did we get this nice error handling  virtually  for  free?
It's simply  that  I've  carefully  avoided  reading  a character
directly  using  GetChar.  Instead,  I've  relied  on  the  error
handling in GetName,  GetNum,  and  Match  to  do  all  the error
checking for me.    Astute  readers  will notice that some of the
calls to Match (for example, the ones in Add  and  Subtract)  are
also unnecessary ... we already know what the character is by the
time  we get there ... but it maintains  a  certain  symmetry  to
leave them in, and  the  general rule to always use Match instead
of GetChar is a good one.

I mentioned an "almost" above.   There  is a case where our error
handling  leaves a bit to be desired.  So far we haven't told our
parser what and  end-of-line  looks  like,  or  what  to  do with
embedded  white  space.  So  a  space  character  (or  any  other
character not part of the recognized character set) simply causes
the parser to terminate, ignoring the unrecognized characters.

It  could  be  argued  that  this is reasonable behavior at  this
point.  In a "real"  compiler, there is usually another statement
following the one we're working on, so any characters not treated
as part of our expression will either be used for or  rejected as
part of the next one.

But  it's  also a very easy thing to fix up, even  if  it's  only
temporary.   All  we  have  to  do  is assert that the expression
should end with an end-of-line , i.e., a carriage return.

To see what I'm talking about, try the input line

               1+2 <space> 3+4

See  how the space was treated as a terminator?  Now, to make the
compiler properly flag this, add the line

               if Look <> CR then Expected('Newline');

in the main  program,  just  after  the call to Expression.  That
catches anything left over in the input stream.  Don't  forget to
define CR in the const statement:

               CR = ^M;

As usual, recompile the program and verify that it does what it's
supposed to.


ASSIGNMENT STATEMENTS

OK,  at  this  point we have a parser that works very nicely. I'd
like to  point  out  that  we  got  it  using  only  88  lines of
executable code, not  counting  what  was  in  the  cradle.   The
compiled  object  file  is  a  whopping  4752  bytes.   Not  bad,
considering we weren't trying very  hard  to  save  either source
code or object size.  We just stuck to the KISS principle.

Of course, parsing an expression  is not much good without having
something to do with it afterwards.  Expressions USUALLY (but not
always) appear in assignment statements, in the form

          <Ident> = <Expression>

We're only a breath  away  from being able to parse an assignment
statement, so let's take that  last  step.  Just  after procedure
Expression, add the following new procedure:


{--------------------------------------------------------------}
{ Parse and Translate an Assignment Statement }

procedure Assignment;
var Name: char;
begin
   Name := GetName;
   Match('=');
   Expression;
   EmitLn('LEA ' + Name + '(PC),A0');
   EmitLn('MOVE D0,(A0)')
end;
{--------------------------------------------------------------}


Note again that the  code  exactly parallels the BNF.  And notice
further that  the error checking was painless, handled by GetName
and Match.

The reason for the two  lines  of  assembler  has  to  do  with a
peculiarity in the  68000,  which requires this kind of construct
for PC-relative code.

Now change the call to Expression, in the main program, to one to
Assignment.  That's all there is to it.

Son of a gun!  We are actually  compiling  assignment statements.
If those were the only kind of statements in a language, all we'd
have to  do  is  put  this in a loop and we'd have a full-fledged
compiler!

Well, of course they're not the only kind.  There are also little
items  like  control  statements  (IFs  and  loops),  procedures,
declarations, etc.  But cheer  up.    The  arithmetic expressions
that we've been dealing with are among the most challenging  in a
language.      Compared  to  what  we've  already  done,  control
statements  will be easy.  I'll be covering  them  in  the  fifth
installment.  And the other statements will all fall in  line, as
long as we remember to KISS.


MULTI-CHARACTER TOKENS

Throughout  this   series,   I've   been   carefully  restricting
everything  we  do  to  single-character  tokens,  all  the while
assuring  you  that  it wouldn't be difficult to extend to multi-
character ones.    I  don't  know if you believed me or not ... I
wouldn't  really blame you if you were a  bit  skeptical.    I'll
continue  to use  that approach in  the  sessions  which  follow,
because it helps keep complexity away.    But I'd like to back up
those  assurances, and wrap up this portion  of  the  parser,  by
showing you  just  how  easy  that  extension  really is.  In the
process, we'll also provide for embedded white space.  Before you
make  the  next  few changes, though, save the current version of
the parser away under another name.  I have some more uses for it
in  the  next  installment, and we'll be working with the single-
character version.

Most compilers separate out the handling of the input stream into
a separate module called  the  lexical scanner.  The idea is that
the scanner deals with all the character-by-character  input, and
returns the separate units  (tokens)  of  the  stream.  There may
come a time when we'll want  to  do something like that, too, but
for  now  there  is  no  need. We can handle the  multi-character
tokens that we need by very slight and  very  local modifications
to GetName and GetNum.

The usual definition of an identifier is that the first character
must be a letter, but the rest can be  alphanumeric  (letters  or
numbers).  To  deal  with  this,  we  need  one  other recognizer
function


{--------------------------------------------------------------}
{ Recognize an Alphanumeric }

function IsAlNum(c: char): boolean;
begin
   IsAlNum := IsAlpha(c) or IsDigit(c);
end;
{--------------------------------------------------------------}


Add this function to your parser.  I put mine just after IsDigit.
While you're  at  it,  might  as  well  include it as a permanent
member of Cradle, too.
                              
Now, we need  to  modify  function  GetName  to  return  a string
instead of a character:


{--------------------------------------------------------------}
{ Get an Identifier }

function GetName: string;
var Token: string;
begin
   Token := '';
   if not IsAlpha(Look) then Expected('Name');
   while IsAlNum(Look) do begin
      Token := Token + UpCase(Look);
      GetChar;
   end;
   GetName := Token;
end;
{--------------------------------------------------------------}


Similarly, modify GetNum to read:


{--------------------------------------------------------------}
{ Get a Number }

function GetNum: string;
var Value: string;
begin
   Value := '';
   if not IsDigit(Look) then Expected('Integer');
   while IsDigit(Look) do begin
      Value := Value + Look;
      GetChar;
   end;
   GetNum := Value;
end;
{--------------------------------------------------------------}


Amazingly enough, that  is  virtually all the changes required to
the  parser!  The local variable Name  in  procedures  Ident  and
Assignment was originally declared as  "char",  and  must  now be
declared string[8].  (Clearly,  we  could  make the string length
longer if we chose, but most assemblers limit the length anyhow.)
Make  this  change,  and  then  recompile and test. _NOW_ do  you
believe that it's a simple change?


WHITE SPACE

Before we leave this parser for awhile, let's  address  the issue
of  white  space.   As it stands now, the parser  will  barf  (or
simply terminate) on a single space  character  embedded anywhere
in  the input stream.  That's pretty  unfriendly  behavior.    So
let's "productionize" the thing  a  bit  by eliminating this last
restriction.

The  key  to easy handling of white space is to come  up  with  a
simple rule for how the parser should treat the input stream, and
to  enforce that rule everywhere.  Up  till  now,  because  white
space wasn't permitted, we've been able to assume that after each
parsing action, the lookahead character  Look  contains  the next
meaningful  character,  so  we could test it  immediately.    Our
design was based upon this principle.

It still sounds like a good rule to me, so  that's  the one we'll
use.    This  means  that  every routine that advances the  input
stream must skip over white space, and leave  the  next non-white
character in Look.   Fortunately,  because  we've been careful to
use GetName, GetNum, and Match  for most of our input processing,
it is  only  those  three  routines  (plus  Init) that we need to
modify.

Not  surprisingly,  we  start  with  yet  another  new recognizer
routine:


{--------------------------------------------------------------}
{ Recognize White Space }

function IsWhite(c: char): boolean;
begin
   IsWhite := c in [' ', TAB];
end;
{--------------------------------------------------------------}


We  also need a routine that  will  eat  white-space  characters,
until it finds a non-white one:


{--------------------------------------------------------------}
{ Skip Over Leading White Space }

procedure SkipWhite;
begin
   while IsWhite(Look) do
      GetChar;
end;
{--------------------------------------------------------------}


Now,  add calls to SkipWhite to Match,  GetName,  and  GetNum  as
shown below:


{--------------------------------------------------------------}
{ Match a Specific Input Character }

procedure Match(x: char);
begin
   if Look <> x then Expected('''' + x + '''')
   else begin
      GetChar;
      SkipWhite;
   end;
end;


{--------------------------------------------------------------}
{ Get an Identifier }

function GetName: string;
var Token: string;
begin
   Token := '';
   if not IsAlpha(Look) then Expected('Name');
   while IsAlNum(Look) do begin
      Token := Token + UpCase(Look);
      GetChar;
   end;
   GetName := Token;
   SkipWhite;
end;


{--------------------------------------------------------------}
{ Get a Number }

function GetNum: string;
var Value: string;
begin
   Value := '';
   if not IsDigit(Look) then Expected('Integer');
   while IsDigit(Look) do begin
      Value := Value + Look;
      GetChar;
   end;
   GetNum := Value;
   SkipWhite;
end;
{--------------------------------------------------------------}

(Note  that  I  rearranged  Match  a  bit,  without changing  the
functionality.)

Finally, we need to skip over leading blanks where we  "prime the
pump" in Init:
                             
{--------------------------------------------------------------}
{ Initialize }

procedure Init;
begin
   GetChar;
   SkipWhite;
end;
{--------------------------------------------------------------}


Make these changes and recompile the program.  You will find that
you will have to move Match below SkipWhite, to  avoid  an  error
message from the Pascal compiler.  Test the program as  always to
make sure it works properly.

Since we've made quite  a  few  changes  during this session, I'm
reproducing the entire parser below:


{--------------------------------------------------------------}
program parse;

{--------------------------------------------------------------}
{ Constant Declarations }

const TAB = ^I;
       CR = ^M;

{--------------------------------------------------------------}
{ Variable Declarations }

var Look: char;              { Lookahead Character }

{--------------------------------------------------------------}
{ Read New Character From Input Stream }

procedure GetChar;
begin
   Read(Look);
end;

{--------------------------------------------------------------}
{ Report an Error }

procedure Error(s: string);
begin
   WriteLn;
   WriteLn(^G, 'Error: ', s, '.');
end;


{--------------------------------------------------------------}
{ Report Error and Halt }
                             
procedure Abort(s: string);
begin
   Error(s);
   Halt;
end;


{--------------------------------------------------------------}
{ Report What Was Expected }

procedure Expected(s: string);
begin
   Abort(s + ' Expected');
end;


{--------------------------------------------------------------}
{ Recognize an Alpha Character }

function IsAlpha(c: char): boolean;
begin
   IsAlpha := UpCase(c) in ['A'..'Z'];
end;


{--------------------------------------------------------------}
{ Recognize a Decimal Digit }

function IsDigit(c: char): boolean;
begin
   IsDigit := c in ['0'..'9'];
end;


{--------------------------------------------------------------}
{ Recognize an Alphanumeric }

function IsAlNum(c: char): boolean;
begin
   IsAlNum := IsAlpha(c) or IsDigit(c);
end;


{--------------------------------------------------------------}
{ Recognize an Addop }

function IsAddop(c: char): boolean;
begin
   IsAddop := c in ['+', '-'];
end;


{--------------------------------------------------------------}
{ Recognize White Space }
                             
function IsWhite(c: char): boolean;
begin
   IsWhite := c in [' ', TAB];
end;


{--------------------------------------------------------------}
{ Skip Over Leading White Space }

procedure SkipWhite;
begin
   while IsWhite(Look) do
      GetChar;
end;


{--------------------------------------------------------------}
{ Match a Specific Input Character }

procedure Match(x: char);
begin
   if Look <> x then Expected('''' + x + '''')
   else begin
      GetChar;
      SkipWhite;
   end;
end;


{--------------------------------------------------------------}
{ Get an Identifier }

function GetName: string;
var Token: string;
begin
   Token := '';
   if not IsAlpha(Look) then Expected('Name');
   while IsAlNum(Look) do begin
      Token := Token + UpCase(Look);
      GetChar;
   end;
   GetName := Token;
   SkipWhite;
end;


{--------------------------------------------------------------}
{ Get a Number }

function GetNum: string;
var Value: string;
begin
   Value := '';
   if not IsDigit(Look) then Expected('Integer');
   while IsDigit(Look) do begin
      Value := Value + Look;
      GetChar;
   end;
   GetNum := Value;
   SkipWhite;
end;


{--------------------------------------------------------------}
{ Output a String with Tab }

procedure Emit(s: string);
begin
   Write(TAB, s);
end;


{--------------------------------------------------------------}
{ Output a String with Tab and CRLF }

procedure EmitLn(s: string);
begin
   Emit(s);
   WriteLn;
end;


{---------------------------------------------------------------}
{ Parse and Translate a Identifier }

procedure Ident;
var Name: string[8];
begin
   Name:= GetName;
   if Look = '(' then begin
      Match('(');
      Match(')');
      EmitLn('BSR ' + Name);
      end
   else
      EmitLn('MOVE ' + Name + '(PC),D0');
end;


{---------------------------------------------------------------}
{ Parse and Translate a Math Factor }

procedure Expression; Forward;

procedure Factor;
begin
   if Look = '(' then begin
      Match('(');
      Expression;
      Match(')');
      end
   else if IsAlpha(Look) then
      Ident
   else
      EmitLn('MOVE #' + GetNum + ',D0');
end;


{--------------------------------------------------------------}
{ Recognize and Translate a Multiply }

procedure Multiply;
begin
   Match('*');
   Factor;
   EmitLn('MULS (SP)+,D0');
end;


{-------------------------------------------------------------}
{ Recognize and Translate a Divide }

procedure Divide;
begin
   Match('/');
   Factor;
   EmitLn('MOVE (SP)+,D1');
   EmitLn('EXS.L D0');
   EmitLn('DIVS D1,D0');
end;


{---------------------------------------------------------------}
{ Parse and Translate a Math Term }

procedure Term;
begin
   Factor;
   while Look in ['*', '/'] do begin
      EmitLn('MOVE D0,-(SP)');
      case Look of
       '*': Multiply;
       '/': Divide;
      end;
   end;
end;


{--------------------------------------------------------------}
{ Recognize and Translate an Add }

procedure Add;
begin
   Match('+');
   Term;
   EmitLn('ADD (SP)+,D0');
end;


{-------------------------------------------------------------}
{ Recognize and Translate a Subtract }

procedure Subtract;
begin
   Match('-');
   Term;
   EmitLn('SUB (SP)+,D0');
   EmitLn('NEG D0');
end;


{---------------------------------------------------------------}
{ Parse and Translate an Expression }

procedure Expression;
begin
   if IsAddop(Look) then
      EmitLn('CLR D0')
   else
      Term;
   while IsAddop(Look) do begin
      EmitLn('MOVE D0,-(SP)');
      case Look of
       '+': Add;
       '-': Subtract;
      end;
   end;
end;


{--------------------------------------------------------------}
{ Parse and Translate an Assignment Statement }

procedure Assignment;
var Name: string[8];
begin
   Name := GetName;
   Match('=');
   Expression;
   EmitLn('LEA ' + Name + '(PC),A0');
   EmitLn('MOVE D0,(A0)')
end;


{--------------------------------------------------------------}
{ Initialize }
                             
procedure Init;
begin
   GetChar;
   SkipWhite;
end;


{--------------------------------------------------------------}
{ Main Program }

begin
   Init;
   Assignment;
   If Look <> CR then Expected('NewLine');
end.
{--------------------------------------------------------------}


Now the parser is complete.  It's got every feature we can put in
a  one-line "compiler."  Tuck it away in a safe place.  Next time
we'll move on to a new subject, but we'll still be  talking about
expressions for quite awhile.  Next installment, I plan to talk a
bit about interpreters as opposed  to compilers, and show you how
the structure of the parser changes a bit as we change  what sort
of action has to be taken.  The information we pick up there will
serve  us in good stead later on, even if you have no interest in
interpreters.  See you next time.


*****************************************************************
*                                                               *
*                        COPYRIGHT NOTICE                       *
*                                                               *
*   Copyright (C) 1988 Jack W. Crenshaw. All rights reserved.   *
*                                                               *
*****************************************************************




--
--/*\ Jon A. Lambert - TychoMUD     Internet:jlsysinc at ix.netcom.com /*\--
--/*\ Mud Server Developer's Page <http://www.netcom.com/~jlsysinc> /*\--
--/*\   "Everything that deceives may be said to enchant" - Plato   /*\--



More information about the mud-dev-archive mailing list