[DGD]parse_string()

Fri Jun 15 14:18:18 CEST 2001

"S. Foley" <s_d_foley at hotmail.com> wrote:

> More parse_string stuff from me I'm afraid.
>
> Here's how I understand how parse_string works.  Please correct me if I'm 
> wrong.  I suspect I have some fundamental misconceptions.
>
> You have a string to parse, some token rules, and some production rules.
> Is the string tokenized immediately with respect to the token rules prior to 
> the production rules?  The concept in my head is that the string is 
> translated into some sequence of tokens, and then the production rules get 
> used to try to recreate the same token sequence.  If the production rules 
> can create a match, then the string may be parsed according to the grammar.  
> I'm starting to feel as if I have it totally backwards though.

This is correct.  The string is tokenized first, and the tokens are then
parsed.

> I have a more practical question that I've tried to solve on and off for a 
> few months now.  I'm trying to write a function that will explode a string 
> with a delimiter specified by a regular expression.  My initial thought was 
> to simply take the regular expression I wanted to use as a delimiter, use a 
> catchall token to catch everything else (like /.+/) and it would be cake.  
> Initially I assumed that there was a way to write the production rules to 
> get around the longest match rule, but since the tokenizing of the string 
> happens prior to the workings of the production rules (err...right?) this is 
> impossible.

Quite.  You can, of course, make a grammar for a regular expression,
but using regular expressions for this is much more efficient.

> Next, I figured I could use /./ instead of /.+/, and just use some LPC 
> functions in the production rule to craft the array.  But that's insane of 
> course, because you are generating an ungodly number of tokens, since the 
> tokenizing occurs prior to the production rules (err... right?).  So while 
> it would kinda work, it's incredibly inefficient.

Correct.

> So I started wondering if every regular expression A had some inverse 
> regular expression B, such that everything A match, B didn't, and everything 
> A didn't, B did.  If I could generate B from A, then I could easily write 
> the production rules and limit the number of tokens to the bare minimum 
> required...
>
> Where I'm currently stuck is how to express the inverse of the concatenation 
> of two regular expressions, a and b.  Is there some trick to it?  You can 
> assume for this that I am able to express the inverse of a (!a) and the 
> inverse of b (!b).

A very good question.

Each regular expression has an inverse.  This is easy to prove: each
regular expression has a corresponding deterministic finite automaton,
and vice versa.  So the inverse regular expression is the one which
expresses the same DFA, but with the final and non-final states
transposed.

Building a DFA for a regular expression is exactly what parse_string()
does.  Unfortunately, the optimal way to compute the inverse of a
regular expression is very close to actually constructing the DFA,
transposing final and non-final states, etc.  This is of course
ridiculous.  If you actually have to build your own DFA to be able
to implement regexplode(), there isn't much point in using
parse_string() in the first place.

So the question becomes, should there be an easier way to use
parse_string() for implementing regexplode()?  The answer to that
question is yes.  I am going to think this over for a bit, to see
what I can do about it.

Regards,
Dworkin

List config page:  http://list.imaginary.com/mailman/listinfo/dgd