[ragel] Handling non-determinism with backtracking

Rick van Rein rick at openfortress.nl
Sat Oct 7 19:32:34 UTC 2017


Hello,

I finally have a good application of Ragel :-D and really enjoy
capturing aspects of email grammar into orthogonal regular expressions!

I am working on a Lenient DKIM extension, and need to detect email
headers and MIME structure and body parts for that purpose.  Since MIME
bodies are the only recursive bits in email syntax, Ragel appears to be
an excellent tool.

MIME body parts are separated with --boundary, which I have expressed as
the following form, where the boundary_string will become a comparison
with a previously collected string.  At any point in a message, there is
no more than one candidate boundary_string to compare to.

mime_multipart_boundary =
    crlf
    '-'{2}
    boundary_string
    lin_wsp*
    crlf ;

I would like to backtrack when the boundary_string or other parts of
this rule mismatch.  In my current syntax there is a
non-deterministically combining alternative for the last boundary
string, which I expect to merge with the one above without further
problems, and otherwise I might do it manually,

mime_multipart_terminal =
    crlf
    '-'{2}
    boundary_string
    '-'{2}
    lin_wsp*
    crlf ;

But when these get together with normal text (that might follow the same
format) I get uncertain about how to arrange backtracking when the
boundary_string or other parts of the mime_multipart_boundary or
_terminal mismatch:

mime_multipart_content =
    mime_illegal_text?
    ( mime_multipart_boundary . mime_bodypart ) +
    mime_multipart_terminal
    mime_illegal_text? ;

Ideally, at least if Ragel is intended as I understand it, the
alternative recongition as mime_bodypart should be orthogonal, and so I
would like to have a way of backtracking into it.  My reading of the
manual says this is not possible in an orthogonal way, but only through
in-situ cross-links and structural changes.

If this is indeed not orthogonal, then I would suggest a feature to make
this possible -- such as an action to execute on mismatches, and a way
to rerun a parallel expression without involving the current one,
perhaps.  That might bring about a state explosion, so it may need some
cleverness :)


Other than this, I found Ragel to be really, really helpful by allowing
me to add constraints orthogonally, like in

whatever_content_wants & utf8char* ;

where utf8char bans illegal and longer-than-usual sequences,

utf8char =
    0x00..0x7f |
    0xc2..0xdf . 0x80..0xbf |
    0xe0       . 0xa0..0xbf . 0x80..0xbf |
    0xe1..0xef . 0x80..0xbf . 0x80..0xbf |
    0xf4..0xff . 0x80..0xbf . 0x80..0xbf | 0x80..0xbf ;

Thank you for that bright approach!


Cheers,
 -Rick




More information about the ragel-users mailing list