[DGD]parse_string()
Felix A. Croes
felix at dworkin.nl
Fri Jun 15 14:18:18 CEST 2001
"S. Foley" <s_d_foley at hotmail.com> wrote:
> More parse_string stuff from me I'm afraid.
>
> Here's how I understand how parse_string works. Please correct me if I'm
> wrong. I suspect I have some fundamental misconceptions.
>
> You have a string to parse, some token rules, and some production rules.
> Is the string tokenized immediately with respect to the token rules prior to
> the production rules? The concept in my head is that the string is
> translated into some sequence of tokens, and then the production rules get
> used to try to recreate the same token sequence. If the production rules
> can create a match, then the string may be parsed according to the grammar.
> I'm starting to feel as if I have it totally backwards though.
This is correct. The string is tokenized first, and the tokens are then
parsed.
> I have a more practical question that I've tried to solve on and off for a
> few months now. I'm trying to write a function that will explode a string
> with a delimiter specified by a regular expression. My initial thought was
> to simply take the regular expression I wanted to use as a delimiter, use a
> catchall token to catch everything else (like /.+/) and it would be cake.
> Initially I assumed that there was a way to write the production rules to
> get around the longest match rule, but since the tokenizing of the string
> happens prior to the workings of the production rules (err...right?) this is
> impossible.
Quite. You can, of course, make a grammar for a regular expression,
but using regular expressions for this is much more efficient.
> Next, I figured I could use /./ instead of /.+/, and just use some LPC
> functions in the production rule to craft the array. But that's insane of
> course, because you are generating an ungodly number of tokens, since the
> tokenizing occurs prior to the production rules (err... right?). So while
> it would kinda work, it's incredibly inefficient.
Correct.
> So I started wondering if every regular expression A had some inverse
> regular expression B, such that everything A match, B didn't, and everything
> A didn't, B did. If I could generate B from A, then I could easily write
> the production rules and limit the number of tokens to the bare minimum
> required...
>
> Where I'm currently stuck is how to express the inverse of the concatenation
> of two regular expressions, a and b. Is there some trick to it? You can
> assume for this that I am able to express the inverse of a (!a) and the
> inverse of b (!b).
A very good question.
Each regular expression has an inverse. This is easy to prove: each
regular expression has a corresponding deterministic finite automaton,
and vice versa. So the inverse regular expression is the one which
expresses the same DFA, but with the final and non-final states
transposed.
Building a DFA for a regular expression is exactly what parse_string()
does. Unfortunately, the optimal way to compute the inverse of a
regular expression is very close to actually constructing the DFA,
transposing final and non-final states, etc. This is of course
ridiculous. If you actually have to build your own DFA to be able
to implement regexplode(), there isn't much point in using
parse_string() in the first place.
So the question becomes, should there be an easier way to use
parse_string() for implementing regexplode()? The answer to that
question is yes. I am going to think this over for a bit, to see
what I can do about it.
Regards,
Dworkin
List config page: http://list.imaginary.com/mailman/listinfo/dgd
More information about the DGD
mailing list