[ragel] Fixing issues with ragel HTML grammar.

Andrey Kulikov amdeich at gmail.com
Wed Jan 18 18:38:17 UTC 2017


In my project I need to extract links from HTML document.
For this purpose I've prepared ragel HTML grammar, primarily based on this
(mentioned here:

Almost all works well (thanks for the great tool!), except one issue I
can't overcome to date:

If I specify this thext as an input:
bbbb <a href="first_link.aspx">  cccc<a href="/second_link.aspx">
my parser can correctly extract first link, but not the second one.
The difference between them is that there is a space between 'bbbb' and
'<a', but no spaces between 'cccc' and '<a'.

In general, if any text, except spaces, exists before '<a' tag it makes
parses consider it as content, and parser do not recognize tag open.

Could please anyone give any hint how to improve existing grammar, in order
to make it recognize tag open?

Please find attached intentionally simplified sample with grammar, aiming
to work as C program ( ngx_url_html_portion.rl ).
There is also input file input-nbsp.html , which expected to contain input
for the application.

In order to play with it, make .c-file from grammar:
ragel ngx_url_html_portion.rl
then compile resulting .c-file and run programm.
Input file should be in the same directory.

Will be sincerely grateful for any clue.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.colm.net/pipermail/ragel-users/attachments/20170118/a0f0792d/attachment-0004.html>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.colm.net/pipermail/ragel-users/attachments/20170118/a0f0792d/attachment-0005.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: ngx_url_html_portion.rl
Type: application/octet-stream
Size: 5384 bytes
Desc: not available
URL: <http://www.colm.net/pipermail/ragel-users/attachments/20170118/a0f0792d/attachment-0002.obj>

More information about the ragel-users mailing list