[ragel] Fixing issues with ragel HTML grammar.

Adrian Thurston thurston at colm.net
Mon Jan 23 07:28:18 UTC 2017


 

Ah sorry Michael, I didn't look at all my mail before I started
responding and so I didn't notice you already responded. 

Adrian 

On 2017-01-19 19:54, Michael Laing wrote: 

> Try changing the definition of 'content' to: 
> 
> content = ( 
> any - (space | '<') 
> )+; 
> 
> Cheers, 
> ml 
> 
>> On Jan 18, 2017, at 13:38 , Andrey Kulikov <amdeich at gmail.com> wrote: 
>> 
>> Hello,
>> 
>> In my project I need to extract links from HTML document. For this purpose I've prepared ragel HTML grammar, primarily based on this work:
>> https://github.com/brianpane/jitify-core/blob/master/src/core/jitify_html_lexer.rl [1] (mentioned here: http://ragel-users.complang.narkive.com/qhjr33zj/ragel-grammars-for-html-css-and-javascript [2] )
>> 
>> Almost all works well (thanks for the great tool!), except one issue I can't overcome to date:
>> 
>> If I specify this thext as an input:
>> bbbb <a href="first_link.aspx"> cccc<a href="/second_link.aspx"> my parser can correctly extract first link, but not the second one. The difference between them is that there is a space between 'bbbb' and '<a', but no spaces between 'cccc' and '<a'. 
>> In general, if any text, except spaces, exists before '<a' tag it makes parses consider it as content, and parser do not recognize tag open.
>> 
>> Could please anyone give any hint how to improve existing grammar, in order to make it recognize tag open?
>> 
>> Please find attached intentionally simplified sample with grammar, aiming to work as C program ( ngx_url_html_portion.rl ). There is also input file input-nbsp.html , which expected to contain input for the application.
>> 
>> In order to play with it, make .c-file from grammar: ragel ngx_url_html_portion.rl 
>> 
>> then compile resulting .c-file and run programm. 
>> Input file should be in the same directory.
>> 
>> Will be sincerely grateful for any clue.
>> 
>> -- 
>> Andrey <input-nbsp.html><ngx_url_html_portion.rl>_______________________________________________
>> ragel mailing list
>> ragel at colm.net
>> http://www.colm.net/cgi-bin/mailman/listinfo/ragel
> 
> _______________________________________________
> ragel mailing list
> ragel at colm.net
> http://www.colm.net/cgi-bin/mailman/listinfo/ragel [3]
 

Links:
------
[1]
https://github.com/brianpane/jitify-core/blob/master/src/core/jitify_html_lexer.rl
[2]
http://ragel-users.complang.narkive.com/qhjr33zj/ragel-grammars-for-html-css-and-javascript
[3] http://www.colm.net/cgi-bin/mailman/listinfo/ragel
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.colm.net/pipermail/ragel-users/attachments/20170123/1c9b4ab2/attachment-0002.html>


More information about the ragel-users mailing list