[ragel-users] simple parser for #include statements

Mon Apr 23 06:29:36 UTC 2018

Background:
In OpenFOAM (www.OpenFOAM.com) we have a flex-based dependency parser. 
It simply goes through the file, finds all the #include "file..." 
statements and in turn processes each of them. It has some internal 
hashing and few other bits that make if faster than 'cpp -M'.
However, this flex solution has it's own problems, one of which is that 
its internal buffer switching means that we can quickly exceed 1024 open 
file descriptors and there doesn't see to be a way to close them after 
processing a file.

I thus had a run at writing a ragel-based version that executes about 
60% faster than the flex-based version and also does a better job of 
closing file descriptors. I was pleased to have found an example to work 
from (https://github.com/danmiley/ragel/blob/master/examples/cppscan.rl).

Problem at hand:
In a stripped down version, I have the following grammar snippet:

%%{
     machine wmkdep;

     action  buffer  { tok = p; /* Local token start */ }
     action  process { processFile(std::string(tok, (p - tok))); }

     white   = [ \t\f\r];        # Horizontal whitespace
     nl      = white* '\n';      # Newline
     dnl     = [^\n]* '\n';      # Discard up to and including newline

     comment := any* :>> '*/' @{ fgoto main; };

     main := |*

         space*; # Discard whitespace, empty lines

         white* '#' white* 'include' white*
             ('"' [^\"]+ >buffer %process '"') dnl;

         '//' dnl;                # 1-line comment
         '/*' { fgoto comment; }; # Multi-line comment

         dnl;                            # Discard all other lines

     *|;
}%%

However, the stripping of multi-line C-comments was failing and any 
#include ... mentioned in a comment was also being seen.

I figured that the example that I'd found with fgoto must be the right 
way, but maybe it wasn't switching at the correct parse point so I 
experimented with this instead:

    comment := any* :>> '*/' %{ fgoto main; };

But it was still parsing (not stripping) the c-comment.
Finally, I did away with the fgoto and coded it straight up:

%%{
     machine wmkdep;

     action  buffer  { tok = p; /* Local token start */ }
     action  process { processFile(std::string(tok, (p - tok))); }

     white   = [ \t\f\r];        # Horizontal whitespace
     nl      = white* '\n';      # Newline (allow trailing whitespace)
     dnl     = (any* -- '\n') '\n';  # Discard up to and including newline

     dquot   = '"';              # double quote
     dqarg   = (any+ -- dquot);  # double quoted argument

     main := |*

         space*;      # Discard whitespace, empty lines

         white* '#' white* 'include' white*
             (dquot dqarg >buffer %process dquot) dnl;

         '//' dnl;               # 1-line comment
         '/*' any* :>> '*/';     # Multi-line comment

         dnl;                    # Discard all other lines

     *|;
}%%

I'm fine with this solution. It strips the c-comments as I wanted, but 
I'd like to understand why the first attempt failed.

Additionally, I found the behaviour of 'dnl' construction (same name and 
behaviour as m4 dnl) rather intriguing. Since the purpose is to delete 
through to and including the newline, I'd expressed it like this:

     dnl = [^\n]* '\n';

However, I found that the following version

     dnl = (any* -- '\n') '\n';

produced a machine that was slightly tighter. I'd have thought that the 
matching would be identical, but the first 'dnl' variant had an 
additional intermediate stage in the machine. All machines were 
generated with ragel 6.9 (since that's what opensuse leap 42.3 ships with).

/mark