How can I capture a big block of data

Wed Feb 21 17:03:43 UTC 2007

On Feb 19, 9:59 pm, "Jason" <jason2... at jasonjobe.com> wrote:

> Now I'm wanting to grab a block of XML data but since the main focus
> of my scan is not XML I'm wondering if it wouldn't be easier to
> delimit any xml with some tags <xml> .... </xml>, and pass the whole
> block of data off to some other code and simply return a TK_XML token.
>
> I note in many examples, zipping through comments is pretty straight
> forward. Could / Should I do the same (or similar) thing to grab my
> XML?

My first thought is that this should be pretty easy. You need to
recognise it in three stages...

1.  The <xml> marker, allowing for whitespace etc.
2.  The content, using a regular expression that refuses to accept the
'</xml>' (use "any* -- [expression for </xml>]").
3.  The expression for </xml>.

The one possible problem is that this would stop early in any xml that
happened to have an 'xml' tag of its own.

There is the possibility of recognising all start and end tags, so you
can keep track of the nesting. Regular expression parsing cannot
handle recursive nesting, so handling this sounds like a job for a
context-free parsing tool such as Yacc, Bison or Kelbt, but in this
case, Ragel can probably cheat.

The trick is to use actions to maintain a counter that keeps track of
nesting depth, and to use semantic conditions to check the count. With
some care, this could even give some confidence that the XML is well
formed.

That said, this also means that badly formed embedded XML will
probably prevent you from finding the real end of the XML.

Different markers could be a good idea. For instance, you should never
see a strings of multiple "<<<<<" characters in well formed XML so
this could be exploited to give a safe end marker. It might be
necessary to exclude strings within quotes, but that's a relatively
simple thing to handle using the strong difference again.