<div dir="ltr"><div><div><div><div><div><div><div><div><div><div><div><div>Hello,<br><br></div>In my project I need to extract links from HTML document.<br></div>For this purpose I've prepared ragel HTML grammar, primarily based on this work:<br><a href="https://github.com/brianpane/jitify-core/blob/master/src/core/jitify_html_lexer.rl">https://github.com/brianpane/jitify-core/blob/master/src/core/jitify_html_lexer.rl</a><br></div>(mentioned here: <a href="http://ragel-users.complang.narkive.com/qhjr33zj/ragel-grammars-for-html-css-and-javascript">http://ragel-users.complang.narkive.com/qhjr33zj/ragel-grammars-for-html-css-and-javascript</a> )<br><br><br></div>Almost all works well (thanks for the great tool!), except one issue I can't overcome to date:<br><br>If I specify this thext as an input:<br>bbbb <a href="first_link.aspx">  cccc<a href="/second_link.aspx"><br></div>my parser can correctly extract first link, but not the second one.<br></div>The difference between them is that there is a space between 'bbbb' and '<a', but no spaces between 'cccc' and '<a'.<br></div><br>In general, if any text, except spaces, exists before '<a' tag it makes parses consider it as content, and parser do not recognize tag open.<br><br></div>Could please anyone give any hint how to improve existing grammar, in order to make it recognize tag open?<br><br></div>Please find attached intentionally simplified sample with grammar, aiming to work as C program ( ngx_url_html_portion.rl ).<br></div>There is also input file input-nbsp.html , which expected to contain input for the application.<br><br></div>In order to play with it, make .c-file from grammar:<br></div>ragel ngx_url_html_portion.rl<div><div><div>then compile resulting .c-file and run programm.<br></div><div>Input file should be in the same directory.<br><br></div><div>Will be <span id="gmail-result_box" class="gmail-short_text" lang="en"><span class="gmail-">sincerely </span></span>grateful for any clue.<br><br>--<br></div><div>Andrey<br></div></div></div></div>