[ragel] Fixing issues with ragel HTML grammar.

Adrian Thurston thurston at colm.net
Mon Jan 23 07:28:18 UTC 2017


Ah sorry Michael, I didn't look at all my mail before I started
responding and so I didn't notice you already responded. 


On 2017-01-19 19:54, Michael Laing wrote: 

> Try changing the definition of 'content' to: 
> content = ( 
> any - (space | '<') 
> )+; 
> Cheers, 
> ml 
>> On Jan 18, 2017, at 13:38 , Andrey Kulikov <amdeich at gmail.com> wrote: 
>> Hello,
>> In my project I need to extract links from HTML document. For this purpose I've prepared ragel HTML grammar, primarily based on this work:
>> https://github.com/brianpane/jitify-core/blob/master/src/core/jitify_html_lexer.rl [1] (mentioned here: http://ragel-users.complang.narkive.com/qhjr33zj/ragel-grammars-for-html-css-and-javascript [2] )
>> Almost all works well (thanks for the great tool!), except one issue I can't overcome to date:
>> If I specify this thext as an input:
>> bbbb <a href="first_link.aspx"> cccc<a href="/second_link.aspx"> my parser can correctly extract first link, but not the second one. The difference between them is that there is a space between 'bbbb' and '<a', but no spaces between 'cccc' and '<a'. 
>> In general, if any text, except spaces, exists before '<a' tag it makes parses consider it as content, and parser do not recognize tag open.
>> Could please anyone give any hint how to improve existing grammar, in order to make it recognize tag open?
>> Please find attached intentionally simplified sample with grammar, aiming to work as C program ( ngx_url_html_portion.rl ). There is also input file input-nbsp.html , which expected to contain input for the application.
>> In order to play with it, make .c-file from grammar: ragel ngx_url_html_portion.rl 
>> then compile resulting .c-file and run programm. 
>> Input file should be in the same directory.
>> Will be sincerely grateful for any clue.
>> -- 
>> Andrey <input-nbsp.html><ngx_url_html_portion.rl>_______________________________________________
>> ragel mailing list
>> ragel at colm.net
>> http://www.colm.net/cgi-bin/mailman/listinfo/ragel
> _______________________________________________
> ragel mailing list
> ragel at colm.net
> http://www.colm.net/cgi-bin/mailman/listinfo/ragel [3]

[3] http://www.colm.net/cgi-bin/mailman/listinfo/ragel
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.colm.net/pipermail/ragel-users/attachments/20170123/1c9b4ab2/attachment-0002.html>

More information about the ragel-users mailing list