[ragel] Fixing issues with ragel HTML grammar.

Adrian Thurston thurston at colm.net
Mon Jan 23 07:26:21 UTC 2017


 

Hi Andrey, 

It's because your content includes the open of an HTML tag. So content
is extened, rather than wrapping around to start a tag. It works when
there is a space in front because only the space FSM is active and won't
be extended. The machine wraps around. 

-Adrian 

On 2017-01-19 01:38, Andrey Kulikov wrote: 

> Hello,
> 
> In my project I need to extract links from HTML document. For this purpose I've prepared ragel HTML grammar, primarily based on this work:
> https://github.com/brianpane/jitify-core/blob/master/src/core/jitify_html_lexer.rl [2] (mentioned here: http://ragel-users.complang.narkive.com/qhjr33zj/ragel-grammars-for-html-css-and-javascript [3] )
> 
> Almost all works well (thanks for the great tool!), except one issue I can't overcome to date:
> 
> If I specify this thext as an input:
> bbbb <a href="first_link.aspx"> cccc<a href="/second_link.aspx"> my parser can correctly extract first link, but not the second one. The difference between them is that there is a space between 'bbbb' and '<a', but no spaces between 'cccc' and '<a'. 
> In general, if any text, except spaces, exists before '<a' tag it makes parses consider it as content, and parser do not recognize tag open.
> 
> Could please anyone give any hint how to improve existing grammar, in order to make it recognize tag open?
> 
> Please find attached intentionally simplified sample with grammar, aiming to work as C program ( ngx_url_html_portion.rl ). There is also input file input-nbsp.html , which expected to contain input for the application.
> 
> In order to play with it, make .c-file from grammar: ragel ngx_url_html_portion.rl 
> 
> then compile resulting .c-file and run programm. 
> Input file should be in the same directory.
> 
> Will be sincerely grateful for any clue.
> 
> -- 
> Andrey 
> 
> _______________________________________________
> ragel mailing list
> ragel at colm.net
> http://www.colm.net/cgi-bin/mailman/listinfo/ragel [1]
 

Links:
------
[1] http://www.colm.net/cgi-bin/mailman/listinfo/ragel
[2]
https://github.com/brianpane/jitify-core/blob/master/src/core/jitify_html_lexer.rl
[3]
http://ragel-users.complang.narkive.com/qhjr33zj/ragel-grammars-for-html-css-and-javascript
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.colm.net/pipermail/ragel-users/attachments/20170123/d6813d69/attachment-0002.html>


More information about the ragel-users mailing list