[DGD]parse_string

Ludger Merkens balduin at uni-paderborn.de
Fri Jan 28 13:31:05 CET 2000


Hello,

I currently try to use parse_string(a,b) to do some preprocessing of
html files. Have a look at my first version, it should separate
tags "<asd dsa>" from text. As well as stripping some chars like 
!,-,[] and so on.


    grammar = "                     \
whitespace=/[ \t\n\r\v\f]+/         \ /* strip some chars */
whitespace=/[!-\\[\\]|]+/           \ /* some more        */
tag=/<[^<]*>/                       \ /* tags start with <, end with >
term =/[a-zA-Z]+'s/                 \ /* e.g. GNU's */
term =/\\(*[a-zA-Z`']+\\)*/         \ /* Hello or 'Hello' */
eol =/<[^<>]*/                      \ /* < but neither < nor > follows. */
s : tag s                           \
s : term s                          \
s : term                            \
s : eol                             \ /* broken tag at end of chunk */
s : tag                             \
";

First of all, I want to know how the whitespace rules are used, e.g. I 
stumbled over a tag containing \n in spite of the first whitespace rule.
Second I need some details about efficiency of parsing. If I use this
grammar on a string containing html with length of 1000 chars, I have
a cost of about 25240 ticks. Whereas I get for 2000 chars 181664 ticks
and for 3000 even 591632. 

- Which execution costs do I have to expect O(n^2), O(n^3) ... (be n the
  length of the string)

- In how far is the cost dependend on the grammar, are there rules which
  are parsed faster than others? Are there some good books, url's how to
  build a well formed (fast parsed) grammar (kfg)

Do you have a better solution for this problem ?


Ludger Merkens
(balduin at uni-paderborn.de)


List config page:  http://list.imaginary.com/mailman/listinfo/dgd



More information about the DGD mailing list