[ragel-users] Primitive lookahead question

Adrian Thurston thurs... at cs.queensu.ca
Thu Sep 20 19:30:43 UTC 2007


Hi Wincent,

I would suggesting using your first solution, then manually shorten the
from_file by 2 characters. It's a simple solution which avoids more
elaborate backtracking/scanning approaches.

What does git produce if the from file contains " b/"?

-Adrian

Wincent Colaiuta wrote:
> Hi,
> 
> I'm trying to parse the output of "git diff" and in particular lines  
> which look like this:
> 
>    diff --git a/my file b/my file
> 
> Where "a/my file" is the "from file" and "b/my file" is the "to  
> file". This is slightly tricky because as you can see there are no  
> delimiters between the two paths other than a space, but spaces are  
> also allowed inside the paths (and Git only uses quotation marks here  
> when the filenames contain embedded tabs, newlines, double-quotes or  
> backslash charcters).
> 
> This means that the only sign that the "from file" has ended and the  
> "to file" has begun is when you hit " b/", but by the time you see  
> that you're already inside the "to file" part. So I made rules to  
> capture the "from file" and the "to file", but my initial attempt at  
> a "from file" rule was broken:
> 
>    from_file = "a/" (any+ -- " b/") ;
> 
> The resulting state machine (quite correctly) takes input like:
> 
>    a/hello b/world
> 
> And identifies the "from file" as:
> 
>    a/hello b
> 
> Which is not what we want. One tactic is mash the "from_file" and  
> "to_file" rules into a single rule:
> 
>    from_to_files = "a/" (any - linefeed)+ " b/" (any - linefeed)+ ;
> 
> But that produces a fairly ugly DFA (especially when you add in rules  
> for parsing quotes filenames with embedded escape sequences). So I  
> tried to implement a primitive form of manual lookahead as follows:
> 
>    from_file = "a/" (any - linefeed)+ %store " b/" @jumpback;
> 
> Where "store" is an action which records the recognized file and  
> "jumpback" is just:
> 
>    action jumpback { p -= 3; }
> 
> The idea being that I have to "lookahead" and see the " b/" to know  
> that the "from file" has been scanned, but then bump the current  
> character pointer back by three so that the machine can resume  
> scanning and looking for the "to file".
> 
> The generated DFA for the rule looks correct to me and isn't too ugly  
> (7 states, about 14 transitions). Is my approach ok, or is there a  
> better way?
> 
> Apart from that the format I am trying to parse is totally regular,  
> unambiguous, and can be parsed without backtracking, which is nice  
> for a change!
> 
> Cheers,
> Wincent
> 
> 
> --~--~---------~--~----~------------~-------~--~----~
> You received this message because you are subscribed to the Google Groups "ragel-users" group.
> To post to this group, send email to ragel-users at googlegroups.com
> To unsubscribe from this group, send email to ragel-users-unsubscribe at googlegroups.com
> For more options, visit this group at http://groups.google.com/group/ragel-users?hl=en
> -~----------~----~----~----~------~----~------~--~---
> 
> 

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 252 bytes
Desc: OpenPGP digital signature
URL: <http://www.colm.net/pipermail/ragel-users/attachments/20070920/0b139318/attachment-0001.sig>


More information about the ragel-users mailing list