# [ragel-users] Ruby buffer code for streaming scanner

Seamus Abshere seamus at abshere.net
Mon Jun 13 15:52:30 UTC 2011

hi,

The Ragel Guide has an excellent set of guidelines for how to "take on
some buffer management functions" when using the longest-match operator
(for scanners):

> \begin{itemize}
> \setlength{\parskip}{0pt}
> \item Read a block of input data.
> \item Run the execute code.
> \item If \verb|ts| is set, the execute code will expect the incomplete
> token to be preserved ahead of the buffer on the next invocation of the execute
> code.
> \begin{itemize}
> \item Shift the data beginning at \verb|ts| and ending at \verb|pe| to the
> beginning of the input buffer.
> \item Reset \verb|ts| to the beginning of the buffer.
> \item Shift \verb|te| by the distance from the old value of \verb|ts|
> to the new value. The \verb|te| variable may or may not be valid.  There is
> no way to know if it holds a meaningful value because it is not kept at null
> when it is not in use. It can be shifted regardless.
> \end{itemize}
> \item Read another block of data into the buffer, immediately following any
> preserved data.
> \item Run the scanner on the new data.
> \end{itemize}

I believe this is a correct implementation in Ruby: (see the #scan!
method for the buffering)

> =begin
> %%{
>   machine foo_scanner;
>
>   foo_open = 'START_FOO';
>   foo_close = 'STOP_FOO';
>   foo = foo_open any* :>> foo_close;
>
>   main := |*
>     foo => { emit data[ts...te].pack('c*') };
>     any;
>   *|;
> }%%
> =end
>
> class FooScanner
>   # read stuff in 1 meg at a time
>   CHUNK_SIZE = 1_048_576
>
>
>   def initialize(target)
>     @target = target
>     %% write data;
>   end
>
>   def emit(foo_entity)
>     puts "I found a foo entity!"
>     puts foo_entity
>   end
>
>   def scan!
>     # Set pe so that ragel doesn't try to get it from data.length
>     pe = -1
>     eof = File.size(target)
>
>     %% write init;
>
>     prefix = []
>     File.open(target) do |f|
>         # \item Read a block of input data.
>         data = prefix + chunk.unpack("c*")
>
>         # \item Run the execute code.
>         p = 0
>         pe = data.length
>         %% write exec;
>
>         # \item If \verb|ts| is set, the execute code will expect the incomplete token to be preserved ahead of the buffer on the next invocation of the execute code.
>         unless ts.nil?
>           # \begin{itemize}
>           # \item Shift the data beginning at \verb|ts| and ending at \verb|pe| to the beginning of the input buffer.
>           prefix = data[ts..pe]
>           # \item Shift \verb|te| by the distance from the old value of \verb|ts| to the new value. The \verb|te| variable may or may not be valid.  There is no way to know if it holds a meaningful value because it is not kept at null when it is not in use. It can be shifted regardless. [SWAPPED ORDER]
>           if te
>             te = te - ts
>           end
>           # \item Reset \verb|ts| to the beginning of the buffer. [SWAPPED ORDER]
>           ts = 0
>           # \end{itemize}
>         else
>           prefix = []
>         end
>         # \item Read another block of data into the buffer, immediately following any preserved data.
>         # \item Run the scanner on the new data.
>       end
>     end
>   end
> end

You can run it with

> foo_scanner = FooScanner.new 'foo.txt'
> foo_scanner.scan!

If that is good code, then perhaps it could be added as another example
to the Ragel website?

Thanks,
Seamus

--
Seamus Abshere
123 N Blount St Apt 403