Minimisation question

Adrian Thurston thurs... at cs.queensu.ca
Thu Sep 14 04:44:24 UTC 2006


Colin, nice work with the XML grammar!

When the kleene star on line 161 is run the machine has an unreachable 
final state with < in its transition list. It's causing a state 
explosion, a rather unnecessary one.

Unreachable states are cleaned up on the fly but some slip through now 
and then. These are reaped at the end of the machine compile and during 
the after-op minimize. It's a mark and sweep operation. Usually 
unreachable states don't cause problems, but you've just found a case 
which proves they do. So in your case just run with -e.

Unfortunately we can't do a full prune of unreachables after every 
operation by default. When compiling machines with many unions or 
concatenations (like test/strings2.rl), constant pruning causes Ragel to 
grind to a halt. We need some kind of selective pruning. Either that or 
I need to fix the on-the-fly cleanup. Needless to say I'll work on it.

Thanks for your investigation!

Cheers,
  Adrian

Colin Fleming wrote:
> Hi Adrian,
> 
> Ok, great that there's no downside to it. Here's some stats (I've
> attached the grammar in case you're interested, it's not finished but
> it's close):
> 
> The part that gives the problem is the doctype. The smallest part I
> could get to compile was doctypedecl:
> 
> No minimisation:
> 
> time ragel asciixml.rl -s -n -M doctypedecl | rlcodegen -V > test.dot
> fsm name  : AsciiXml
> num states: 1361
> 
> real    1m58.401s
> user    1m56.850s
> sys     0m1.481s
> 
> Memory use peaks at about 765MB. If I try and use the next largest
> production it allocates up to 1.8GB and then malloc fails.
> 
> With minimisation:
> 
> time ragel asciixml.rl -s -M doctypedecl | rlcodegen -V > test.dot
> fsm name  : AsciiXml
> num states: 269
> 
> real    1m58.358s
> user    1m56.792s
> sys     0m1.453s
> 
> More or less the same time and memory usage.
> 
> However with incremental minimisation:
> 
> time ragel asciixml.rl -s -e -M doctypedecl | rlcodegen -V > test.dot
> fsm name  : AsciiXml
> num states: 269
> 
> real    0m0.076s
> user    0m0.069s
> sys     0m0.010s
> 
> It's practically instantaneous and works a charm. It also easily
> allows me to compile the whole grammar, which is significantly more
> complex:
> 
> time ragel asciixml.rl -s -e | rlcodegen -V > test.dot
> fsm name  : AsciiXml
> num states: 445
> 
> real    0m0.124s
> user    0m0.119s
> sys     0m0.010s
> 
> This is a cut-down grammar that only uses ASCII characters, the full
> XML spec requires Unicode, this makes the machine much more complex
> because all the character ranges are treated properly (i.e. the same
> number of states but a lot more transitions). Using incremental
> minimisation allows that machine to be generated in 2.199s.
> 
> Assuming it's reliable, I can't see a reason not to use it, or to have
> it turned on by default.
> 
> Cheers,
> Colin
> 
> 
> 



More information about the ragel-users mailing list