Tutorial: Let's build a Compiler! - Part XII: Miscellany

Jon A. Lambert jlsysinc at ix.netcom.com
Mon Mar 2 16:39:45 CET 1998


                     LET'S BUILD A COMPILER!

                                By

                     Jack W. Crenshaw, Ph.D.

                           5 June 1989


                       Part XII: MISCELLANY


*****************************************************************
*                                                               *
*                        COPYRIGHT NOTICE                       *
*                                                               *
*   Copyright (C) 1989 Jack W. Crenshaw. All rights reserved.   *
*                                                               *
*****************************************************************


INTRODUCTION

This installment is another one  of  those  excursions  into side
alleys  that  don't  seem to fit  into  the  mainstream  of  this
tutorial  series.    As I mentioned last time, it was while I was
writing this installment that I realized some changes  had  to be
made  to  the  compiler structure.  So I had to digress from this
digression long enough to develop the new structure  and  show it
to you.

Now that that's behind us, I can tell you what I  set  out  to in
the first place.  This shouldn't  take  long, and then we can get
back into the mainstream.

Several people have asked  me  about  things that other languages
provide, but so far I haven't addressed in this series.   The two
biggies are semicolons and  comments.    Perhaps  you've wondered
about them, too, and  wondered  how things would change if we had
to  deal with them.  Just so you can proceed with what's to come,
without being  bothered by that nagging feeling that something is
missing, we'll address such issues here.


SEMICOLONS

Ever since the introduction of Algol, semicolons have been a part
of  almost every modern language.  We've all  used  them  to  the
point that they are taken for  granted.   Yet I suspect that more
compilation errors have  occurred  due  to  misplaced  or missing
semicolons  than  any  other single cause.  And if we had a penny
for  every  extra  keystroke programmers have used  to  type  the
little rascals, we could pay off the national debt.

Having  been  brought  up with FORTRAN, it took me a long time to
get used to using semicolons, and to tell the  truth  I've  never
quite understood why they  were  necessary.    Since I program in
Pascal, and since the use of semicolons in Pascal is particularly
tricky,  that one little character is still  by  far  my  biggest
source of errors.

When  I  began  developing  KISS,  I resolved to  question  EVERY
construct in other languages, and to try to avoid the most common
problems that occur with them.  That puts the semicolon very high
on my hit list.

To  understand  the  role of the semicolon, you have to look at a
little history.

Early programming languages were line-oriented.  In  FORTRAN, for
example, various parts  of  the statement had specific columns or
fields that they had to appear in.  Since  some  statements  were
too  long for one line, the  "continuation  card"  mechanism  was
provided to let  the  compiler  know  that a given card was still
part of the previous  line.   The mechanism survives to this day,
even though punched cards are now things of the distant past.

When  other  languages  came  along,  they  also  adopted various
mechanisms for dealing with multiple-line statements.  BASIC is a
good  example.  It's important to  recognize,  though,  that  the
FORTRAN  mechanism  was   not   so  much  required  by  the  line
orientation of that  language,  as by the column-orientation.  In
those versions of FORTRAN  where  free-form  input  is permitted,
it's no longer needed.

When the fathers  of  Algol introduced that language, they wanted
to get away  from  line-oriented programs like FORTRAN and BASIC,
and allow for free-form input.   This included the possibility of
stringing multiple statements on a single line, as in


     a=b; c=d; e=e+1;


In cases like this,  the  semicolon is almost REQUIRED.  The same
line, without the semicolons, just looks "funny":


     a=b c= d e=e+1

I suspect that this is the major ... perhaps ONLY ...  reason for
semicolons: to keep programs from looking funny.

But  the  idea  of stringing multiple statements  together  on  a
single  line  is  a  dubious  one  at  best.  It's not very  good
programming  style,  and  harks back to  the  days  when  it  was
considered improtant to conserve cards.  In these  days  of CRT's
and indented code, the clarity of programs is  far  better served
by  keeping statements separate.  It's still  nice  to  have  the
OPTION  of  multiple  statements,  but  it seems a shame to  keep
programmers  in  slavery  to the semicolon, just to keep that one
rare case from "looking funny."

When I started in with KISS, I tried  to  keep  an  open mind.  I
decided that I would use  semicolons when it became necessary for
the parser, but not until then.  I figured this would happen just
about  the time I added the ability  to  spread  statements  over
multiple lines.  But, as you  can  see, that never happened.  The
TINY compiler is perfectly  happy  to  parse the most complicated
statement, spread over any number of lines, without semicolons.

Still, there are people  who  have  used  semicolons for so long,
they feel naked  without them.  I'm one of them.  Once I had KISS
defined sufficiently well, I began to write a few sample programs
in the language.    I  discovered,  somewhat to my horror, that I
kept  putting  semicolons  in anyway.   So  now  I'm  facing  the
prospect of a NEW  rash  of  compiler  errors, caused by UNWANTED
semicolons.  Phooey!

Perhaps more to the point, there are readers out  there  who  are
designing their own languages, which may  include  semicolons, or
who  want to use the techniques of  these  tutorials  to  compile
conventional languages like  C.    In  either case, we need to be
able to deal with semicolons.


SYNTACTIC SUGAR

This whole discussion brings  up  the  issue of "syntactic sugar"
... constructs that are added to a language, not because they are
needed, but because they help make the programs look right to the
programmer.    After  all, it's nice  to  have  a  small,  simple
compiler,    but  it  would  be  of  little  use if the resulting
language  were  cryptic  and hard to program.  The language FORTH
comes  to mind (a premature OUCH! for the  barrage  I  know  that
one's going to fetch me).  If we can add features to the language
that  make the programs easier to read  and  understand,  and  if
those features  help keep the programmer from making errors, then
we should do so.    Particularly if the constructs don't add much
to the complexity of the language or its compiler.

The  semicolon  could  be considered an example,  but  there  are
plenty of others, such as the 'THEN' in a IF-statement,  the 'DO'
in a WHILE-statement,  and  even the 'PROGRAM' statement, which I
came within a gnat's eyelash of leaving out  of  TINY.    None of
these tokens  add  much  to  the  syntax  of the language ... the
compiler can figure out  what's  going on without them.  But some
folks feel that they  DO  add to the readability of programs, and
that can be very important.

There are two schools of thought on this subject, which  are well
represented by two of our most popular languages, C and Pascal.

To  the minimalists, all such sugar should be  left  out.    They
argue that it clutters up the language and adds to the  number of
keystrokes  programmers  must type.   Perhaps  more  importantly,
every extra token or keyword represents a trap laying in wait for
the inattentive programmer.  If you leave out  a  token, misplace
it, or misspell it, the compiler  will  get you.  So these people
argue that the best approach is to get rid of such things.  These
folks tend to like C, which has a minimum of unnecessary keywords
and punctuation.

Those from the other school tend to like Pascal.  They argue that
having to type a few extra characters is a small price to pay for
legibility.    After  all, humans have to read the programs, too.
Their best argument is that each such construct is an opportunity
to tell the compiler that you really mean for it  to  do what you
said to.  The sugary tokens serve as useful landmarks to help you
find your way.

The differences are well represented by the two  languages.   The
most oft-heard complaint about  C  is  that  it is too forgiving.
When you make a mistake in C, the  erroneous  code  is  too often
another  legal  C  construct.    So  the  compiler  just  happily
continues to compile, and  leaves  you  to  find the error during
debug.    I guess that's why debuggers  are  so  popular  with  C
programmers.

On the  other  hand,  if  a  Pascal  program compiles, you can be
pretty  sure that the program will do what you told it.  If there
is an error at run time, it's probably a design error.

The  best  example  of  useful  sugar  is  the semicolon  itself.
Consider the code fragment:


     a=1+(2*b+c)   b...


Since there is no operator connecting the token 'b' with the rest
of the  statement, the compiler will conclude that the expression
ends  with  the  ')', and the 'b'  is  the  beginning  of  a  new
statement.    But  suppose  I  have simply left out the  intended
operator, and I really want to say:


     a=1+(2*b+c)*b...


In  this  case  the compiler will get an error, all right, but it
won't be very meaningful  since  it will be expecting an '=' sign
after the 'b' that really shouldn't be there.

If, on the other hand, I include a semicolon after the  'b', THEN
there  can  be no doubt where I  intend  the  statement  to  end.
Syntactic  sugar,  then,  can  serve  a  very  useful purpose  by
providing some additional insurance that we remain on track.

I find  myself  somewhere  in  the middle of all this.  I tend to
favor the Pascal-ers' view ... I'd much rather find  my  bugs  at
compile time rather than run time.  But I also hate to just throw
verbosity  in  for  no apparent reason, as in COBOL.  So far I've
consistently left most of the Pascal sugar out of KISS/TINY.  But
I certainly have no strong feelings either way, and  I  also  can
see the value of sprinkling a little sugar around  just  for  the
extra  insurance  that  it  brings.    If  you like  this  latter
approach, things like that are easy to add.  Just  remember that,
like  the semicolon, each item of sugar  is  something  that  can
potentially cause a compile error by its omission.


DEALING WITH SEMICOLONS

There  are  two  distinct  ways  in which semicolons are used  in
popular  languages.    In Pascal, the semicolon is regarded as an
statement SEPARATOR.  No semicolon  is  required  after  the last
statement in a block.  The syntax is:


     <block> ::= <statement> ( ';' <statement>)*

     <statement> ::= <assignment> | <if> | <while> ... | null


(The null statement is IMPORTANT!)

Pascal  also defines some semicolons in  other  places,  such  as
after the PROGRAM statement.

In  C  and  Ada, on the other hand, the semicolon is considered a
statement TERMINATOR,  and  follows  all  statements  (with  some
embarrassing and confusing  exceptions).   The syntax for this is
simply:


     <block> ::= ( <statement> ';')*


Of  the two syntaxes, the Pascal one seems on the face of it more
rational, but experience has shown  that it leads to some strange
difficulties.  People get  so  used  to  typing a semicolon after
every  statement  that  they tend to  type  one  after  the  last
statement in a block, also.  That usually doesn't cause  any harm
...  it  just gets treated as a  null  statement.    Many  Pascal
programmers, including yours truly,  do  just  that. But there is
one  place you absolutely CANNOT type  a  semicolon,  and  that's
right before an ELSE.  This little gotcha  has  cost  me  many an
extra  compilation,  particularly  when  the  ELSE  is  added  to
existing code.    So  the  C/Ada  choice  turns out to be better.
Apparently Nicklaus Wirth thinks so, too:  In his  Modula  2,  he
abandoned the Pascal approach.

Given either of these two syntaxes, it's an easy matter (now that
we've  reorganized  the  parser!) to add these  features  to  our
parser.  Let's take the last case first, since it's simpler.

To begin, I've made things easy by introducing a new recognizer:


{--------------------------------------------------------------}
{ Match a Semicolon }

procedure Semi;
begin
   MatchString(';');
end;
{--------------------------------------------------------------}


This procedure works very much like our old Match.  It insists on
finding a semicolon as the next token.  Having found it, it skips
to the next one.

Since a  semicolon follows a statement, procedure Block is almost
the only one we need to change:


{--------------------------------------------------------------}
{ Parse and Translate a Block of Statements }

procedure Block;
begin
   Scan;
   while not(Token in ['e', 'l']) do begin
      case Token of
       'i': DoIf;
       'w': DoWhile;
       'R': DoRead;
       'W': DoWrite;
       'x': Assignment;
      end;
      Semi;
      Scan;
   end;
end;
{--------------------------------------------------------------}


Note carefully the subtle change in the case statement.  The call
to  Assignment  is now guarded by a test on Token.   This  is  to
avoid calling Assignment when the  token  is  a  semicolon (which
could happen if the statement is null).

Since declarations are also  statements,  we  also  need to add a
call to Semi within procedure TopDecls:


{--------------------------------------------------------------}
{ Parse and Translate Global Declarations }

procedure TopDecls;
begin
   Scan;
   while Token = 'v' do begin
      Alloc;
      while Token = ',' do
         Alloc;
      Semi;
   end;
end;
{--------------------------------------------------------------}


Finally, we need one for the PROGRAM statement:


{--------------------------------------------------------------}
{ Main Program }

begin
   Init;
   MatchString('PROGRAM');
   Semi;
   Header;
   TopDecls;
   MatchString('BEGIN');
   Prolog;
   Block;
   MatchString('END');
   Epilog;
end.
{--------------------------------------------------------------}


It's as easy as that.  Try it with a copy of TINY and see how you
like it.

The Pascal version  is  a  little  trickier,  but  it  still only
requires  minor  changes,  and those only to procedure Block.  To
keep things as simple as possible, let's split the procedure into
two parts.  The following procedure handles just one statement:


{--------------------------------------------------------------}
{ Parse and Translate a Single Statement }

procedure Statement;
begin
   Scan;
   case Token of
    'i': DoIf;
    'w': DoWhile;
    'R': DoRead;
    'W': DoWrite;
    'x': Assignment;
   end;
end;
{--------------------------------------------------------------}


Using this procedure, we can now rewrite Block like this:


{--------------------------------------------------------------}
{ Parse and Translate a Block of Statements }

procedure Block;
begin
   Statement;
   while Token = ';' do begin
      Next;
      Statement;
   end;
end;
{--------------------------------------------------------------}


That  sure  didn't  hurt, did it?  We can now parse semicolons in
Pascal-like fashion.


A COMPROMISE

Now that we know how to deal with semicolons, does that mean that
I'm going to put them in KISS/TINY?  Well, yes and  no.    I like
the extra sugar and the security that comes with knowing for sure
where the  ends  of  statements  are.    But I haven't changed my
dislike for the compilation errors associated with semicolons.

So I have what I think is a nice compromise: Make them OPTIONAL!

Consider the following version of Semi:


{--------------------------------------------------------------}
{ Match a Semicolon }

procedure Semi;
begin
   if Token = ';' then Next;
end;
{--------------------------------------------------------------}


This procedure will ACCEPT a semicolon whenever it is called, but
it won't INSIST on one.  That means that when  you  choose to use
semicolons, the compiler  will  use the extra information to help
keep itself on track.  But if you omit one (or omit them all) the
compiler won't complain.  The best of both worlds.

Put this procedure in place in the first version of  your program
(the  one for C/Ada syntax), and you have  the  makings  of  TINY
Version 1.2.


COMMENTS

Up  until  now  I have carefully avoided the subject of comments.
You would think that this would be an easy subject ... after all,
the compiler doesn't have to deal with comments at all; it should
just ignore them.  Well, sometimes that's true.

Comments can be just about as easy or as difficult as  you choose
to make them.    At  one  extreme,  we can arrange things so that
comments  are  intercepted  almost  the  instant  they  enter the
compiler.  At the  other,  we can treat them as lexical elements.
Things  tend to get interesting when  you  consider  things  like
comment delimiters contained in quoted strings.


SINGLE-CHARACTER DELIMITERS

Here's an example.  Suppose we assume the  Turbo  Pascal standard
and use curly braces for comments.  In this case we  have single-
character delimiters, so our parsing is a little easier.

One  approach  is  to  strip  the  comments  out the  instant  we
encounter them in the input stream; that is,  right  in procedure
GetChar.    To  do  this,  first  change  the  name of GetChar to
something else, say GetCharX.  (For the record, this is  going to
be a TEMPORARY change, so best not do this with your only copy of
TINY.  I assume you understand that you should  always  do  these
experiments with a working copy.)

Now, we're going to need a  procedure  to skip over comments.  So
key in the following one:


{--------------------------------------------------------------}
{ Skip A Comment Field }

procedure SkipComment;
begin
   while Look <> '}' do
      GetCharX;
   GetCharX;
end;
{--------------------------------------------------------------}


Clearly, what this procedure is going to do is to simply read and
discard characters from the input  stream, until it finds a right
curly brace.  Then it reads one more character and returns  it in
Look.

Now we can  write  a  new  version of GetChar that SkipComment to
strip out comments:


{--------------------------------------------------------------}
{ Get Character from Input Stream }
{ Skip Any Comments }

procedure GetChar;
begin
   GetCharX;
   if Look = '{' then SkipComment;
end;
{--------------------------------------------------------------}


Code this up  and  give  it  a  try.    You'll find that you can,
indeed, bury comments anywhere you like.  The comments never even
get into the parser proper ... every call to GetChar just returns
any character that's NOT part of a comment.

As a matter of fact, while  this  approach gets the job done, and
may even be  perfectly  satisfactory  for  you, it does its job a
little  TOO  well.    First  of all, most  programming  languages
specify that a comment should be treated like a  space,  so  that
comments aren't allowed  to  be embedded in, say, variable names.
This current version doesn't care WHERE you put comments.

Second, since the  rest  of  the  parser can't even receive a '{'
character, you will not be allowed to put one in a quoted string.

Before you turn up your nose at this simplistic solution, though,
I should point out  that  as respected a compiler as Turbo Pascal
also won't allow  a  '{' in a quoted string.  Try it.  And as for
embedding a comment in an  identifier, I can't imagine why anyone
would want to do such a  thing,  anyway, so the question is moot.
For 99% of all  applications,  what I've just shown you will work
just fine.

But,  if  you  want  to  be  picky  about it  and  stick  to  the
conventional treatment, then we  need  to  move  the interception
point downstream a little further.

To  do  this,  first change GetChar back to the way  it  was  and
change the name called in SkipComment.  Then, let's add  the left
brace as a possible whitespace character:


{--------------------------------------------------------------}
{ Recognize White Space }

function IsWhite(c: char): boolean;
begin
   IsWhite := c in [' ', TAB, CR, LF, '{'];
end;
{--------------------------------------------------------------}


Now, we can deal with comments in procedure SkipWhite:


{--------------------------------------------------------------}
{ Skip Over Leading White Space }

procedure SkipWhite;
begin
   while IsWhite(Look) do begin
      if Look = '{' then
         SkipComment
      else
         GetChar;
   end;
end;
{--------------------------------------------------------------}


Note  that SkipWhite is written so that we  will  skip  over  any
combination of whitespace characters and comments, in one call.

OK, give this one a try, too.   You'll  find  that  it will let a
comment serve to delimit tokens.  It's worth mentioning that this
approach also gives us the  ability to handle curly braces within
quoted strings, since within such  strings we will not be testing
for or skipping over whitespace.

There's one last  item  to  deal  with:  Nested  comments.   Some
programmers like the idea  of  nesting  comments, since it allows
you to comment out code during debugging.  The  code  I've  given
here won't allow that and, again, neither will Turbo Pascal.

But the fix is incredibly easy.  All  we  need  to  do is to make
SkipComment recursive:


{--------------------------------------------------------------}
{ Skip A Comment Field }

procedure SkipComment;
begin
   while Look <> '}' do begin
      GetChar;
      if Look = '{' then SkipComment;
   end;
   GetChar;
end;
{--------------------------------------------------------------}


That does it.  As  sophisticated a comment-handler as you'll ever
need.


MULTI-CHARACTER DELIMITERS

That's all well and  good  for cases where a comment is delimited
by single  characters,  but  what  about  the  cases such as C or
standard Pascal, where two  characters  are  required?  Well, the
principles are still the same, but we have to change our approach
quite a bit.  I'm sure it won't surprise you to learn that things
get harder in this case.

For the multi-character situation, the  easiest thing to do is to
intercept the left delimiter  back  at the GetChar stage.  We can
"tokenize" it right there, replacing it by a single character.

Let's assume we're using the C delimiters '/*' and '*/'.   First,
we  need  to  go back to the "GetCharX' approach.  In yet another
copy of your compiler, rename  GetChar to GetCharX and then enter
the following new procedure GetChar:


{--------------------------------------------------------------}
{ Read New Character.  Intercept '/*' }

procedure GetChar;
begin
   if TempChar <> ' ' then begin
      Look := TempChar;
      TempChar := ' ';
      end
   else begin
      GetCharX;
      if Look = '/' then begin
         Read(TempChar);
         if TempChar = '*' then begin
            Look := '{';
            TempChar := ' ';
         end;
      end;
   end;
end;
{--------------------------------------------------------------}


As you can see, what this procedure does is  to  intercept  every
occurrence of '/'.  It then examines the NEXT  character  in  the
stream.  If the character  is  a  '*',  then  we  have  found the
beginning  of  a  comment,  and  GetChar  will  return  a  single
character replacement for it.   (For  simplicity,  I'm  using the
same '{' character  as I did for Pascal.  If you were writing a C
compiler, you'd no doubt want to pick some other character that's
not  used  elsewhere  in C.  Pick anything you like ... even $FF,
anything that's unique.)

If the character  following  the  '/'  is NOT a '*', then GetChar
tucks it away in the new global TempChar, and  returns  the  '/'.

Note that you need to declare this new variable and initialize it
to ' '.  I like to do  things  like  that  using the Turbo "typed
constant" construct:


     const TempChar: char = ' ';


Now we need a new version of SkipComment:


{--------------------------------------------------------------}
{ Skip A Comment Field }

procedure SkipComment;
begin
   repeat
      repeat
         GetCharX;
      until Look = '*';
      GetCharX;
   until Look = '/';
   GetChar;
end;
{--------------------------------------------------------------}


A  few  things  to  note:  first  of  all, function  IsWhite  and
procedure SkipWhite  don't  need  to  be  changed,  since GetChar
returns the '{' token.  If you change that token  character, then
of  course you also need to change the  character  in  those  two
routines.

Second, note that  SkipComment  doesn't call GetChar in its loop,
but  GetCharX.    That  means   that  the  trailing  '/'  is  not
intercepted and  is seen by SkipComment.  Third, although GetChar
is the  procedure  doing  the  work,  we  can still deal with the
comment  characters  embedded  in  a  quoted  string,  by calling
GetCharX  instead  of  GetChar  while  we're  within  the string.
Finally,  note  that  we can again provide for nested comments by
adding a single statement to SkipComment, just as we did before.


ONE-SIDED COMMENTS

So far I've shown you  how  to  deal  with  any  kind  of comment
delimited on the left and the  right.   That only leaves the one-
sided comments like those in assembler language or  in  Ada, that
are terminated by the end of the line.  In a  way,  that  case is
easier.   The only procedure that would need  to  be  changed  is
SkipComment, which must now terminate at the newline characters:


{--------------------------------------------------------------}
{ Skip A Comment Field }

procedure SkipComment;
begin
   repeat
      GetCharX;
   until Look = CR;
   GetChar;
end;
{--------------------------------------------------------------}


If the leading character is  a  single  one,  as  in  the  ';' of
assembly language, then we're essentially done.  If  it's  a two-
character token, as in the '--'  of  Ada, we need only modify the
tests  within  GetChar.   Either way, it's an easier problem than
the balanced case.


CONCLUSION

At this point we now have the ability to deal with  both comments
and semicolons, as well as other kinds of syntactic sugar.   I've
shown  you several ways to deal with  each,  depending  upon  the
convention  desired.    The  only  issue left is: which of  these
conventions should we use in KISS/TINY?

For the reasons that I've given as we went  along,  I'm  choosing
the following:


 (1) Semicolons are TERMINATORS, not separators

 (2) Semicolons are OPTIONAL

 (3) Comments are delimited by curly braces

 (4) Comments MAY be nested


Put the code corresponding to these cases into your copy of TINY.
You now have TINY Version 1.2.

Now that we  have  disposed  of  these  sideline  issues,  we can
finally get back into the mainstream.  In  the  next installment,
we'll talk  about procedures and parameter passing, and we'll add
these important features to TINY.  See you then.


*****************************************************************
*                                                               *
*                        COPYRIGHT NOTICE                       *
*                                                               *
*   Copyright (C) 1989 Jack W. Crenshaw. All rights reserved.   *
*                                                               *
*****************************************************************

--
--/*\ Jon A. Lambert - TychoMUD     Internet:jlsysinc at ix.netcom.com /*\--
--/*\ Mud Server Developer's Page <http://www.netcom.com/~jlsysinc> /*\--
--/*\   "Everything that deceives may be said to enchant" - Plato   /*\--






More information about the mud-dev-archive mailing list