speed vs. re2c?

Thu Oct 5 23:32:15 UTC 2006

Cool -- I'm glad to know it's competetive with re2c.  When I went to
look up what '-G' does, I was also happy to see that there are lots of
options for how the code is generated.

Let me explain why I was interested in re2c, and why I'm now interested
in Ragel.  Many people I've talked to think this idea is crap, so I
won't be offended if you do too, but I really believe in it.

Text processing is one of the most common bottlenecks in high-level
languages.  The regular expression engines that are built into
languages like Perl, Ruby, Python, etc. are useful for pattern matching
on isolated strings, but aren't optimal for the case where you want to
want to parse a file in a known format, beginning to end.  If you
designed a library specifically for this use case, you could get lots
of nice benefits like:

* its API could be more along the lines of what you want: set up a
bunch of patterns and rules, then set the library in motion on an input
stream.

* you could write an optimized buffering layer that keeps a
configurable number of trailing tokens in memory at once.

* you could use a library like Ragel to generate goto-based scanners at
runtime, so that you could get the performance improvements over the
table-based scanners that existing regex engines use.  Basically I am
proposing use Ragel as the backend for a regex JIT.

You could compile to C and then use an embedded C compiler (like
libtcc) to compile to machine code.  Personally I would be more
interested in generating assembly code directly, since it wouldn't be
as heavyweight a process and would give you the opportunity to optimize
better than the C compiler, since you are working within a very narrow
problem domain.

I don't know when I'd actually get to this, but I'm very interested in
seeing it done, and will probably try to use Ragel in this way at some
point.  What do you think?

Josh