[ragel-users] Priority issues when doing a street name parser

William Lachance wrlach at gmail.com
Thu Sep 24 01:36:12 UTC 2009


Hi,

I'm trying to construct a parser for street addresses using Ragel.
That is to say, a machine that will take a free form address like
"5553 Barrington Street NW" and parse out the individual components
(street number, name, suffix, direction). Everything was going
swimmingly until I started to try to add support for street names with
multiple tokens in them (e.g. "Bella Vista Avenue NW")

Right now my main machine looks like this:

streetNumber = (digit+ >getStartStr %endNumber);
streetName = (alpha+ (space+ alpha+)*) >getStartStr %endName;
suffixFull = space+ suffix
dirFull = space+ direction
main := (streetNumber alpha? space+)? streetName suffixFull? dirFull?

The suffix and dir expressions are really long and boring
concatenations like this:

directionWest = ("w"i|"west"i) >getStartStr %endDirWest;

Anyway, the problem with this simple regular expression is that it
doesn't give up on parsing the streetName when it begins parsing the
direction and suffix. So in the above example, it will correctly parse
"Bella Vista", but then overwrite it with "Avenue", and later "NW". I
thought that perhaps adding a few ":>>"'s (to stop the processing of
the streetname when suffixes and directions appear) would help:

main := (streetNumber alpha? space+)? streetName :>> suffixFull? :>> dirFull? 0;

Unfortunately, that seems to have the side effect of terminating
parsing of the street name prematurely (bringing us back to square
one).

It _seems_ like what I'm doing should be straightforward. Basically
the rule should be: "keep on parsing the street until you find a token
that unambiguously matches a suffix and/or direction; at that point,
stop, only keeping the previous tokens". Surely there's a way of
expressing that in Ragel?

-- 
William Lachance
wrlach at gmail.com




More information about the ragel-users mailing list