string parsing

Wed Oct 29 17:38:08 CET 1997

What follows is a description of something I am currently implementing
for my server.  I am providing it here for others to comment on.  Also,
I'd like to hear of other approaches to the same problem.

The notation for datatypes and code is from LPC.  Note that `({ ... })'
is used to describe an array in LPC.  Thus,

    ({ "foo", "bar", 47 })

describes an array of 3 items, the last of which is the number 47.

---

The problem: string parsing.

Rather than concentrating on a specific instance, such as parsing natural
language for the purpose of interpreting player commands, I have chosen
to provide a tool that is useful for implementing a natural language
parser.  Another design goal was that it should be equally suitable for
parsing a programming language, and powerful enough to parse LPC itself.

Function:

    parse_string(string grammar, string str)

Description:

    Parse the string `str' according to the grammar.  Return the parse tree
    (or an array of parse trees).

The grammar consists of rules.  There are two types of rules: token rules
and production rules.  A token rule describes a token with a regular
expression.  A production rule describes a production in the context-free
grammar for the language to be parsed.  Thus, the parsing function can be
considered to incorporate the unix tools lex and yacc.

A token rule looks like this:

    token = /regexp/

`token' is a symbol associated with a particular token in the grammar,
and the regular expression, delimited by `/' characters, describes the
form it takes in the input.  More than one token rule may be specified
with the same token symbol.  The special token symbol `whitespace' can
be used to describe token-separating whitespace.

In a regular expression, the characters `[ ]' enclose a subrange.  For
instance, `[abc]' matches either of the characters a, b or c.  A dash
can be used to indicate a subrange: `[0-9]' matches any digit.

The following characters have a special meaning in regular expressions:

    ?	the preceding item is optional
    *	the preceding item matches zero or more times
    +	the preceding item matches at least once
    |	either the preceding or the following item is matched

A production rule looks like any of the following:

    prod: word
    prod: foo
    prod: ( bar )
    prod: 'gnu'
    prod: token prod foo ( bar 'gnu' )
    prod:
    prod: foo bar ? func

That is, on the right-hand side of a production rule can be any
combination of token symbols, production symbols, strings enclosed in
single quotes, and matched parenthesis enclosing at least one other
item.  A production rule may optionally be followed by a question mark
and the name of an LPC function.

The first production rule defines the start symbol.  Alternative
productions for the same symbol must be specified in different rules.

Henceforth, a token symbol will be called a `terminal' and a production
symbol will be called a `nonterminal'.

Parenthesis in production rules is used to structure the resulting
parse tree.  For example, given the following grammar:

    whitespace = / /
    word = /[a-z]+/

    Thing: Noun
    Thing: ( Adjs ) Noun
    Noun: word
    Adjs: word Adjs
    Adjs: word

parse_string(grammar, "bottle") will return

    ({ "bottle" }),

whereas parse_string(grammar, "big blue bottle") will return

    ({ ({ "big", "blue" }), "bottle" })

That is, the resulting parse tree is not structured according to
the production rules of the grammar, but as specified by parenthesis
only.  Without parenthesis in any rules, a single flat array of
tokens is returned.

If a production rule is followed by a question mark and an LPC function
name, the function will be called with the intermediate parse tree
for the current rule, and the return value will replace the intermediate
parse tree in the overall return value.

The grammar may be ambiguous.  Parsing is done in stages: first, all
possible parse trees for the input are generated, using Tomita's
breadth-first LR(k) parsing algorithm.  Next, any LPC functions specified
for rules are called (bottom-up), and the intermediate parse trees for
those rules are replaced by the return values of those functions.  As
a special case, if the LPC function returns zero, the entire parse tree
that this intermediate parse tree is part of is discarded.  Finally,
the resulting parse tree(s) are returned.

NOTE: I have not yet decided how to handle more than one parse tree in
the result. When parsing a programming language, only one single valid
parse tree is desired.  On the other hand, when parsing natural language,
ambiguities cannot always be resolved during syntax analysis.  For
example, the sentence "kill the troll in the forest" can mean, "go to
the forest and kill the troll you find there" or "take this captive
troll to the forest and kill him".  Even with a captive troll on hand,
the player should probably be notified about the ambiguity.

Example: command parsing

Grammar:

    whitespace = / /
    word =       /[a-z]+/

    Sentence:   'give' Object 'to' Living
    Object:     ( Article ) ( Adjs ) Noun	? find_obj
    Living:     ( Article ) ( Adjs ) Noun	? find_liv
    Article:    'the'
    Article:    'a'
    Article:    'an'
    Article:    'any'
    Article:
    Noun:       word
    Adjectives: word Adjectives
    Adjectives:

Note that due to the structuring of the rule for the nonterminal
`Object', the LPC function find_obj() is always called with an
array of 3 items, the first of which is an array holding at most one
article, the second of which is an array of adjectives, and the third
of which is a noun.

If find_obj() and find_liv() both return matching objects, then parsing
the following sentence

    "give long blue sword to the small dwarf"

will cause find_obj() to be called with

    ({ ({ }), ({ "long", "blue" }), "sword" })

and find_liv() with

    ({ ({ "the" }), ({ "small" }), "dwarf" })

and parse_string() will return the result

    ({ "give", OBJ(sword), "to", OBJ(dwarf) })

Felix Croes