[ragel-users] Priority issues when doing a street name parser

Sun Sep 27 15:44:52 UTC 2009

If you want shortest match you'll have to program that manually with a 
union of patterns and some actions to record the shortest one that you 
want to pull from the head of the input stream.

Perl regex's are quite different from those of Ragel. Perl has a much 
more sophisticated runtime engine that supports many extensions to 
regexes. The ragel runtime engine is much simpler, allowing directly 
executable state machines (using -G2 option).

-Adrian

William Lachance wrote:
> (sorry about the duplicated mail-- stupid gmail sent my message before
> it was ready) :)
> 
> Hi Adrian,
> 
> Thanks for the quick response. Trying to unpack what you're saying--
> do you mean I should try to define a scanner (as defined in section
> 6.3 of the manual) which tries the various possibilities for street
> names (in order from most preferred to least)?
> 
> So one might have
> 
> main := |*
>   streetWithSuffixAndDirection;
>   streetWithDirection;
>   streetWithSuffix
>   street
> 
> ?
> 
> I was looking a little bit more at regular expressions, and it seems
> like perl compatible re's have some special options which allow you to
> define how matches are supposed to occur. For example:
> 
> http://www.boost.org/doc/libs/1_40_0/libs/regex/doc/html/boost_regex/syntax/perl_syntax.html
> 
> "*? Matches the previous atom zero or more times, while consuming as
> little input as possible." seems like exactly what I need (a quick
> test indicates it gives the desired behaviour). Would it not be
> possible for ragel to do this sort of thing?
> 
> Will
> 
> 2009/9/23 Adrian Thurston <thurston at complang.org>:
>> Hi William,
>>
>> I think what you need is a traditional lexer. See section 6.3 of the manual.
>>
>> -Adrian
>>
>> William Lachance wrote:
>>> Hi,
>>>
>>> I'm trying to construct a parser for street addresses using Ragel.
>>> That is to say, a machine that will take a free form address like
>>> "5553 Barrington Street NW" and parse out the individual components
>>> (street number, name, suffix, direction). Everything was going
>>> swimmingly until I started to try to add support for street names with
>>> multiple tokens in them (e.g. "Bella Vista Avenue NW")
>>>
>>> Right now my main machine looks like this:
>>>
>>> streetNumber = (digit+ >getStartStr %endNumber);
>>> streetName = (alpha+ (space+ alpha+)*) >getStartStr %endName;
>>> suffixFull = space+ suffix
>>> dirFull = space+ direction
>>> main := (streetNumber alpha? space+)? streetName suffixFull? dirFull?
>>>
>>> The suffix and dir expressions are really long and boring
>>> concatenations like this:
>>>
>>> directionWest = ("w"i|"west"i) >getStartStr %endDirWest;
>>>
>>> Anyway, the problem with this simple regular expression is that it
>>> doesn't give up on parsing the streetName when it begins parsing the
>>> direction and suffix. So in the above example, it will correctly parse
>>> "Bella Vista", but then overwrite it with "Avenue", and later "NW". I
>>> thought that perhaps adding a few ":>>"'s (to stop the processing of
>>> the streetname when suffixes and directions appear) would help:
>>>
>>> main := (streetNumber alpha? space+)? streetName :>> suffixFull? :>> dirFull? 0;
>>>
>>> Unfortunately, that seems to have the side effect of terminating
>>> parsing of the street name prematurely (bringing us back to square
>>> one).
>>>
>>> It _seems_ like what I'm doing should be straightforward. Basically
>>> the rule should be: "keep on parsing the street until you find a token
>>> that unambiguously matches a suffix and/or direction; at that point,
>>> stop, only keeping the previous tokens". Surely there's a way of
>>> expressing that in Ragel?
>>>
>> _______________________________________________
>> ragel-users mailing list
>> ragel-users at complang.org
>> http://www.complang.org/mailman/listinfo/ragel-users
>>
> 
> 
>