[ragel-users] simple parser for #include statements

Adrian Thurston thurston at colm.net
Wed Apr 25 13:45:02 UTC 2018


Hi Mark,

So the thing to remember here is that a scanner will always try for the 
longest match possible, and only in the case of matches of equal length 
will it choose the pattern that appears ahead of the others. So in this 
case the dnl at the end is taking precedence over the comment rules. It 
doesn't interfere with the include matching rule because it also has a 
dnl at the end.

For the catch all you want to use just the any machine. It will go one 
char at a time and this may seem less efficient, but ragel does its best 
to optimize this.

In regards to the slightly tighter machine that you mentioned, it would 
be interesting to see before and after grammars in full to see what's 
going on. On their own they produce the same machine, but in the context 
of something larger there might be something preventing it, or it could 
be a missed opportunity for optimization.

-Adrian

On 2018-04-23 02:29, Mark Olesen wrote:
> Background:
> In OpenFOAM (www.OpenFOAM.com) we have a flex-based dependency parser.
> It simply goes through the file, finds all the #include "file..."
> statements and in turn processes each of them. It has some internal
> hashing and few other bits that make if faster than 'cpp -M'.
> However, this flex solution has it's own problems, one of which is
> that its internal buffer switching means that we can quickly exceed
> 1024 open file descriptors and there doesn't see to be a way to close
> them after processing a file.
> 
> I thus had a run at writing a ragel-based version that executes about
> 60% faster than the flex-based version and also does a better job of
> closing file descriptors. I was pleased to have found an example to
> work from
> (https://github.com/danmiley/ragel/blob/master/examples/cppscan.rl).
> 
> Problem at hand:
> In a stripped down version, I have the following grammar snippet:
> 
> %%{
>     machine wmkdep;
> 
>     action  buffer  { tok = p; /* Local token start */ }
>     action  process { processFile(std::string(tok, (p - tok))); }
> 
>     white   = [ \t\f\r];        # Horizontal whitespace
>     nl      = white* '\n';      # Newline
>     dnl     = [^\n]* '\n';      # Discard up to and including newline
> 
>     comment := any* :>> '*/' @{ fgoto main; };
> 
>     main := |*
> 
>         space*; # Discard whitespace, empty lines
> 
>         white* '#' white* 'include' white*
>             ('"' [^\"]+ >buffer %process '"') dnl;
> 
>         '//' dnl;                # 1-line comment
>         '/*' { fgoto comment; }; # Multi-line comment
> 
>         dnl;                            # Discard all other lines
> 
>     *|;
> }%%
> 
> However, the stripping of multi-line C-comments was failing and any
> #include ... mentioned in a comment was also being seen.
> 
> I figured that the example that I'd found with fgoto must be the right
> way, but maybe it wasn't switching at the correct parse point so I
> experimented with this instead:
> 
>    comment := any* :>> '*/' %{ fgoto main; };
> 
> But it was still parsing (not stripping) the c-comment.
> Finally, I did away with the fgoto and coded it straight up:
> 
> 
> 
> %%{
>     machine wmkdep;
> 
>     action  buffer  { tok = p; /* Local token start */ }
>     action  process { processFile(std::string(tok, (p - tok))); }
> 
>     white   = [ \t\f\r];        # Horizontal whitespace
>     nl      = white* '\n';      # Newline (allow trailing whitespace)
>     dnl     = (any* -- '\n') '\n';  # Discard up to and including 
> newline
> 
>     dquot   = '"';              # double quote
>     dqarg   = (any+ -- dquot);  # double quoted argument
> 
>     main := |*
> 
>         space*;      # Discard whitespace, empty lines
> 
>         white* '#' white* 'include' white*
>             (dquot dqarg >buffer %process dquot) dnl;
> 
>         '//' dnl;               # 1-line comment
>         '/*' any* :>> '*/';     # Multi-line comment
> 
>         dnl;                    # Discard all other lines
> 
>     *|;
> }%%
> 
> 
> I'm fine with this solution. It strips the c-comments as I wanted, but
> I'd like to understand why the first attempt failed.
> 
> Additionally, I found the behaviour of 'dnl' construction (same name
> and behaviour as m4 dnl) rather intriguing. Since the purpose is to
> delete through to and including the newline, I'd expressed it like
> this:
> 
>     dnl = [^\n]* '\n';
> 
> However, I found that the following version
> 
>     dnl = (any* -- '\n') '\n';
> 
> produced a machine that was slightly tighter. I'd have thought that
> the matching would be identical, but the first 'dnl' variant had an
> additional intermediate stage in the machine. All machines were
> generated with ragel 6.9 (since that's what opensuse leap 42.3 ships
> with).
> 
> /mark
> 
> _______________________________________________
> ragel-users mailing list
> ragel-users at colm.net
> http://www.colm.net/cgi-bin/mailman/listinfo/ragel-users



More information about the ragel-users mailing list