Multi-char terminators

Colin Fleming colin.flem... at coreproc.com
Fri Oct 6 17:06:03 UTC 2006


Hi Adrian,

Thanks for the response! I need to think about it a bit more.
Obviously in this case it's not a huge problem, but it might be if I
move to marking strings rather than copying and a buffer boundary
happens to break up the terminator. The problem with constructing the
machine manually is that I can't really do any better than Ragel does
- if you have no look-ahead, you never know if you're on a terminator
until the end of it.

I'll read up a bit about scanners, too, it sounds interesting.

Cheers,
Colin

On 10/5/06, Adrian Thurston <thurs... at cs.queensu.ca> wrote:
>
> Hello,
>
> If you wanted to remove buffered items when the termination sequence was
> variable length, you might be able to record the length of the buffer when
> you start the termination sequence. This might not always work properly though.
>
> But if you want to avoid undoing work you've done, then you need to delay
> buffering. At the moment I can't think of a general way to express the
> delayed buffering of ']' using pure regular languages with embedded actions.
>
> The local error action embedding operators are related to this problem, but
> not a good fit in this case.
>
> So, some options:
>
> 1. You could build a machine manually. Basically draw out the state machine
> you want and use the , and -> operators to construct it. Note that you can
> still embed actions anywhere you want. In places where you go back to start
> buffer the necessary number of ']' characters.
>
> main :=
>      start: (
>          (any-']') -> start |
>          ']'-> one
>      ),
>      one: (
>          ']' -> two |
>          [^\]] -> start
>      ),
>      two: (
>          '>' -> final |
>          ']' -> two |
>          [^>\]] -> start
>      );
>
>
> 2. Use a mini scanner. This is the kind of thing a scanner does really well,
> but it does not give you a machine definition you can embed elsewhere. You
> have to call it. This gives me an idea though. Some scanners can be
> optimized into a pure state machine with no backtracking. Perhaps we can
> allow these to be embedded elsewhere.
>
> 3. Take ']' out of CData and add in some patterns like ']' [^\]] which
> accept only strings which look like they could start a termination sequence,
> but never go all the way. When they fail they can write out necessary number
> of ']' symbols.
>
> Hope this helps.
>
> -Adrian
>
> Colin Fleming wrote:
> > Hi all,
> >
> > As part of parsing XML, I have the following rules for CData sections:
> >
> > CDStart = '<![CDATA[';
> >
> > CDEnd = ']]>';
> >
> > CData = (Char* -- CDEnd) $each_char;
> >
> > CDSect = CDStart CData CDEnd;
> >
> > where each_char is a simple action that stores fc in a buffer. The
> > problem is that the last two characters in the buffer are always ]],
> > because the machine doesn't know until it encounters the > if it
> > should exit the CData machine. I work around this with the following:
> >
> > CDSect = CDStart CData CDEnd %trim_content;
> >
> > where trim_content strips the last two characters of the buffer, but
> > it's a bit ugly. It also wouldn't work if the terminator were some
> > variable-length production. Is there any general way to handle this
> > case?
> >
> > Cheers,
> > Colin
> >
> >
>
> >
>



More information about the ragel-users mailing list