Primitive lookahead question

Wincent Colaiuta w... at
Thu Sep 20 18:01:04 UTC 2007


I'm trying to parse the output of "git diff" and in particular lines  
which look like this:

   diff --git a/my file b/my file

Where "a/my file" is the "from file" and "b/my file" is the "to  
file". This is slightly tricky because as you can see there are no  
delimiters between the two paths other than a space, but spaces are  
also allowed inside the paths (and Git only uses quotation marks here  
when the filenames contain embedded tabs, newlines, double-quotes or  
backslash charcters).

This means that the only sign that the "from file" has ended and the  
"to file" has begun is when you hit " b/", but by the time you see  
that you're already inside the "to file" part. So I made rules to  
capture the "from file" and the "to file", but my initial attempt at  
a "from file" rule was broken:

   from_file = "a/" (any+ -- " b/") ;

The resulting state machine (quite correctly) takes input like:

   a/hello b/world

And identifies the "from file" as:

   a/hello b

Which is not what we want. One tactic is mash the "from_file" and  
"to_file" rules into a single rule:

   from_to_files = "a/" (any - linefeed)+ " b/" (any - linefeed)+ ;

But that produces a fairly ugly DFA (especially when you add in rules  
for parsing quotes filenames with embedded escape sequences). So I  
tried to implement a primitive form of manual lookahead as follows:

   from_file = "a/" (any - linefeed)+ %store " b/" @jumpback;

Where "store" is an action which records the recognized file and  
"jumpback" is just:

   action jumpback { p -= 3; }

The idea being that I have to "lookahead" and see the " b/" to know  
that the "from file" has been scanned, but then bump the current  
character pointer back by three so that the machine can resume  
scanning and looking for the "to file".

The generated DFA for the rule looks correct to me and isn't too ugly  
(7 states, about 14 transitions). Is my approach ok, or is there a  
better way?

Apart from that the format I am trying to parse is totally regular,  
unambiguous, and can be parsed without backtracking, which is nice  
for a change!


More information about the ragel-users mailing list