[DGD]Parse String

Thu Jun 7 04:51:43 CEST 2001

On Wed, Jun 06, 2001 at 09:49:13PM -0400, S. Foley wrote:
> I apologize in advance if I am working off of an outdated parse_string
> help file.
> 
> My first question relates to the type of 'operators' (I don't know what
> else to call them) usable in token rules.  The help file for parse
> string indicates the following such operators are available:

I think they may be called 'quantifiers', not sure though.

> >and, with regular expressions "a" and "b":
> >
> >   a*      zero or more occurrences of a	(highest precedence)
> >   a+      one or more occurrences of a
> >   ab      the concatenation of a and b
> >   a|b     a or b
> >   (a)     a					(lowest precedence)
> 
> Yet looking at an example grammar Mr. Croes wrote I see the following:
> 
> >FLOAT_CONST = /[0-9]+\\.[0-9]*([eE][-+]?[0-9]+)/ \
> >FLOAT_CONST = /[0-9]*\\.[0-9]+([eE][-+]?[0-9]+)/ \
> 
> Now from what I understand '?' is frequently used in regular expressions
> to indicate 0 or 1 occurences of what precedes it.  Is that what it is
> being used for here?

Yes. :)

>                       If so, are there any other 'operators' like this
> that are frequently used in regular expressions that are useable in
> token rules that are undocumented (assuming I have an up to date help
> file for parse_string)?

Not that I'm aware of, assuming you're referring to such things as
\d and \s to represent [0-9] and [\t\n\r ] for instance?  I _think_
Dworkin considers those unnecessary syntactic sugar, convenient but
not required in the essential functionality.  For instance, there is
as far as I know no <regexp>{n,m} support, since those can be
rewritten with the already available features.

> I also have another question relating to precedence.  The parse_string
> help file states the following:
> 
> >For any regular expression, the longest possible token will be
> >matched.  The name "whitespace" is reserved for defining a special
> >token, which is simply skipped.  More than one rule may be specified > for 
> >each token, including whitespace.
> 
> >If a string matches more than token, the token for which the rule
> >appears first in the grammar is selected.  If a string does not match
> >any token, it is rejected and parsing fails.
> 
> My question is if I define some tokens:
> 
>   token1 = /[a-z]+/
>   token2 = /.*/
> 
> Despite the fact that the rule for token1 precedes the rule for token2,
> nothing is ever going to be matched up to token1 because the longest
> match precedence rule takes precedence over all?  My experiments with
> the function and my reading up on the subject of regexp's seem to
> indicate that this would be the case.

Yes.  The order becomes important when more than one rule matches, for
instance:

  token1 = /[a-zA-Z0-9]+/
  token2 = /foo/

In thise case, the sequence 'foo' would be seen as a token1, not
token2, so that any rule with token2 in it will never be matched in
practice.

There is another catch, if you define something like this:

  word = /[a-z]+/

  rule : word 'foo'

Then the 'word' regexp will never match 'foo' because it is used in
the grammar.  A workaround here is to do something like this:

  word = /[a-z]+/

  rule : _word 'foo'

  _word : word
  _word : 'foo'

> I have one last theoretical question.  I've been told that for every
> regular expression it is possible to write an inverse regular expression.  
> That is to say, if I have a regular expression A, I can
> write a regular expression B such that everything that A matches, B will
> not match, and everything A does not match, B will match.  How difficult
> would it be to implement an 'operator' (like * or +) for token rules
> that could be used to mean any string that does not match the regular
> expression that precedes it?  Would it be possible to write a front
> end to parse_string using parse_string itself to generate B from A?
> 
> I have almost no background in the computer sciences, so I apologize
> if any of these questions were trivial.

For individual cases it _sounds_ straightforward, but it wouldn't
surprise me at all if there were some very nasty catches lurking
right around the corner there... anyone else feel like tackling this
one? :-)

Erwin.
-- 
Erwin Harte <harte at xs4all.nl>

List config page:  http://list.imaginary.com/mailman/listinfo/dgd