[ragel] Handling non-determinism with backtracking

Adrian Thurston thurston at colm.net
Wed Oct 18 21:41:47 UTC 2017


Hi Rick,

The last time I wrote a mime parser with ragel all the parser had to do 
was to find the boundaries and possibly emit the parsed sections to a 
chained to parser. When a boundary match failed the content matched so 
far became an emitted text line and I could just use the prefix of the 
boundary that matched so far.

If you need to backtrack to parse see the section in the manual on 
buffer boundaries and backtracking. It's not straightforward, but if the 
common case is that you will match the text and only occasionally need 
to backtrack it can be very fast.

Glad you like the tool!

Best regards,
  Adrian

On 2017-10-07 16:32, Rick van Rein wrote:
> Hello,
> 
> I finally have a good application of Ragel :-D and really enjoy
> capturing aspects of email grammar into orthogonal regular expressions!
> 
> I am working on a Lenient DKIM extension, and need to detect email
> headers and MIME structure and body parts for that purpose.  Since MIME
> bodies are the only recursive bits in email syntax, Ragel appears to be
> an excellent tool.
> 
> MIME body parts are separated with --boundary, which I have expressed 
> as
> the following form, where the boundary_string will become a comparison
> with a previously collected string.  At any point in a message, there 
> is
> no more than one candidate boundary_string to compare to.
> 
> mime_multipart_boundary =
>     crlf
>     '-'{2}
>     boundary_string
>     lin_wsp*
>     crlf ;
> 
> I would like to backtrack when the boundary_string or other parts of
> this rule mismatch.  In my current syntax there is a
> non-deterministically combining alternative for the last boundary
> string, which I expect to merge with the one above without further
> problems, and otherwise I might do it manually,
> 
> mime_multipart_terminal =
>     crlf
>     '-'{2}
>     boundary_string
>     '-'{2}
>     lin_wsp*
>     crlf ;
> 
> But when these get together with normal text (that might follow the 
> same
> format) I get uncertain about how to arrange backtracking when the
> boundary_string or other parts of the mime_multipart_boundary or
> _terminal mismatch:
> 
> mime_multipart_content =
>     mime_illegal_text?
>     ( mime_multipart_boundary . mime_bodypart ) +
>     mime_multipart_terminal
>     mime_illegal_text? ;
> 
> Ideally, at least if Ragel is intended as I understand it, the
> alternative recongition as mime_bodypart should be orthogonal, and so I
> would like to have a way of backtracking into it.  My reading of the
> manual says this is not possible in an orthogonal way, but only through
> in-situ cross-links and structural changes.
> 
> If this is indeed not orthogonal, then I would suggest a feature to 
> make
> this possible -- such as an action to execute on mismatches, and a way
> to rerun a parallel expression without involving the current one,
> perhaps.  That might bring about a state explosion, so it may need some
> cleverness :)
> 
> 
> Other than this, I found Ragel to be really, really helpful by allowing
> me to add constraints orthogonally, like in
> 
> whatever_content_wants & utf8char* ;
> 
> where utf8char bans illegal and longer-than-usual sequences,
> 
> utf8char =
>     0x00..0x7f |
>     0xc2..0xdf . 0x80..0xbf |
>     0xe0       . 0xa0..0xbf . 0x80..0xbf |
>     0xe1..0xef . 0x80..0xbf . 0x80..0xbf |
>     0xf4..0xff . 0x80..0xbf . 0x80..0xbf | 0x80..0xbf ;
> 
> Thank you for that bright approach!
> 
> 
> Cheers,
>  -Rick
> 
> _______________________________________________
> ragel mailing list
> ragel at colm.net
> http://www.colm.net/cgi-bin/mailman/listinfo/ragel




More information about the ragel-users mailing list