<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN">

<html><body style='font-size: 14pt; font-family: Verdana,Geneva,sans-serif'>

<p>Hi Andrey,</p>

<p>It's because your content includes the open of an HTML tag. So content is extened, rather than wrapping around to start a tag. It works when there is a space in front because only the space FSM is active and won't be extended. The machine wraps around.</p>

<p>-Adrian</p>

<p> </p>

<p>On 2017-01-19 01:38, Andrey Kulikov wrote:</p>

<blockquote type="cite" style="padding-left:5px; border-left:#1010ff 2px solid; margin-left:5px"><!-- html ignored --><!-- head ignored --><!-- meta ignored -->

<div dir="ltr">

<div>

<div>

<div>

<div>

<div>

<div>

<div>

<div>

<div>

<div>

<div>

<div>Hello,<br /><br /></div>

In my project I need to extract links from HTML document.</div>

For this purpose I've prepared ragel HTML grammar, primarily based on this work:<br /><a href="https://github.com/brianpane/jitify-core/blob/master/src/core/jitify_html_lexer.rl">https://github.com/brianpane/jitify-core/blob/master/src/core/jitify_html_lexer.rl</a></div>

(mentioned here: <a href="http://ragel-users.complang.narkive.com/qhjr33zj/ragel-grammars-for-html-css-and-javascript">http://ragel-users.complang.narkive.com/qhjr33zj/ragel-grammars-for-html-css-and-javascript</a> )<br /><br /><br /></div>

Almost all works well (thanks for the great tool!), except one issue I can't overcome to date:<br /><br />If I specify this thext as an input:<br />bbbb <a href="first_link.aspx">  cccc<a href="/second_link.aspx"></div>

my parser can correctly extract first link, but not the second one.</div>

The difference between them is that there is a space between 'bbbb' and '<a', but no spaces between 'cccc' and '<a'.</div>

<br />In general, if any text, except spaces, exists before '<a' tag it makes parses consider it as content, and parser do not recognize tag open.<br /><br /></div>

Could please anyone give any hint how to improve existing grammar, in order to make it recognize tag open?<br /><br /></div>

Please find attached intentionally simplified sample with grammar, aiming to work as C program ( ngx_url_html_portion.rl ).</div>

There is also input file input-nbsp.html , which expected to contain input for the application.<br /><br /></div>

In order to play with it, make .c-file from grammar:</div>

ragel ngx_url_html_portion.rl

<div>

<div>

<div>then compile resulting .c-file and run programm.</div>

<div>Input file should be in the same directory.<br /><br /></div>

<div>Will be <span id="gmail-result_box" class="gmail-short_text"><span class="gmail-">sincerely </span></span>grateful for any clue.<br /><br />--</div>

<div>Andrey</div>

</div>

</div>

</div>

<br />

<pre>_______________________________________________

ragel mailing list

<a href="mailto:ragel@colm.net">ragel@colm.net</a>

<a href="http://www.colm.net/cgi-bin/mailman/listinfo/ragel">http://www.colm.net/cgi-bin/mailman/listinfo/ragel</a>

</pre>

</blockquote>

</body></html>