[ragel-users] Maintaining char & line counts in a scanner

Adrian Thurston adrian.thurston at esentire.com
Fri Apr 23 18:31:12 UTC 2010


Hi Joe,

There are a few approaches to this problem. The simplest approach is to 
simply count newlines in the matched text in every match action. The 
downside to this is that you are passing over everything twice.

If a second pass over each token is something you'd like to avoid, then 
you can go down the sub-scanner road. Basically, any pattern that can 
contain a newline, such as multi-line comments, or strings, can be 
implemented with a sub-scanner. In the main scanner you write a pattern 
for whatever sequence of characters takes you into comments, for 
example, then jump into a separate scanner for comments. You end up with 
broken down comments though, as opposed to a whole match of a comment.

A third approach is to write patterns that count newlines as they go. 
This is my favourite approach. The only worry is backtracking. If your 
scanner patterns backtrack over newlines, then you've got double 
counting happening. With a well-designed scanner, this isn't normally a 
problem though. Try something like this:

counter = ( any | '\n' @inc )*;
comment = ( '/*' any* :>> '*/' ) & counter;

Or embed the counting deep:

comment = ( '/*' ( any | '\n' @inc )* :>> '*/' ) & counter;

-Adrian

> Hi All,
> 
> I'm using ragel as a scanner to tokenise input for parsing of a database query language. I'd like to maintain a line number and character offset in the struct that represents a matched token but I'm having a little difficulty.
> 

> 
> My idea would be to have two expressions - one that matches a newline and one that matches any other character. Clearly there would be an associated action with these expressions to maintain variables for the line and char count. Currently I have various expressions, some of which can potentially match multiple newlines (think multi-line comments), and some of which consume dead input (whitespace). I have played around keeping a tally of the counts on each successful match of a token (outside of the machine exex), but as in some cases I am discarding input completely within the state machine and not creating a token, it becomes difficult to track.... ideally, I'd like to keep it all within the machine, but can't see the best way to proceed.
> 
> Any help or pointers would be much appreciated.
> 
> Cheers,
> -Joe
> _______________________________________________
> ragel-users mailing list
> ragel-users at complang.org
> http://www.complang.org/mailman/listinfo/ragel-users
> 

_______________________________________________
ragel-users mailing list
ragel-users at complang.org
http://www.complang.org/mailman/listinfo/ragel-users



More information about the ragel-users mailing list