[ragel] Fixing issues with ragel HTML grammar.

Michael Laing mlaing at post.harvard.edu
Thu Jan 19 12:54:15 UTC 2017

Try changing the definition of ‘content’ to:

    content = (
      any - (space | '<')


> On Jan 18, 2017, at 13:38 , Andrey Kulikov <amdeich at gmail.com> wrote:
> Hello,
> In my project I need to extract links from HTML document.
> For this purpose I've prepared ragel HTML grammar, primarily based on this work:
> https://github.com/brianpane/jitify-core/blob/master/src/core/jitify_html_lexer.rl <https://github.com/brianpane/jitify-core/blob/master/src/core/jitify_html_lexer.rl>
> (mentioned here: http://ragel-users.complang.narkive.com/qhjr33zj/ragel-grammars-for-html-css-and-javascript <http://ragel-users.complang.narkive.com/qhjr33zj/ragel-grammars-for-html-css-and-javascript> )
> Almost all works well (thanks for the great tool!), except one issue I can't overcome to date:
> If I specify this thext as an input:
> bbbb <a href="first_link.aspx">  cccc<a href="/second_link.aspx">
> my parser can correctly extract first link, but not the second one.
> The difference between them is that there is a space between 'bbbb' and '<a', but no spaces between 'cccc' and '<a'.
> In general, if any text, except spaces, exists before '<a' tag it makes parses consider it as content, and parser do not recognize tag open.
> Could please anyone give any hint how to improve existing grammar, in order to make it recognize tag open?
> Please find attached intentionally simplified sample with grammar, aiming to work as C program ( ngx_url_html_portion.rl ).
> There is also input file input-nbsp.html , which expected to contain input for the application.
> In order to play with it, make .c-file from grammar:
> ragel ngx_url_html_portion.rl
> then compile resulting .c-file and run programm.
> Input file should be in the same directory.
> Will be sincerely grateful for any clue.
> --
> Andrey
> <input-nbsp.html><ngx_url_html_portion.rl>_______________________________________________
> ragel mailing list
> ragel at colm.net
> http://www.colm.net/cgi-bin/mailman/listinfo/ragel

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.colm.net/pipermail/ragel-users/attachments/20170119/533b2c75/attachment-0002.html>

More information about the ragel-users mailing list