Matching multibyte or wide-chars

rakrok rak... at gmail.com
Thu Apr 10 17:22:40 UTC 2008


Hello,

I'm trying to tokenize multibyte strings.  In C-land, I would read in
a mb char, convert it to widechar, and then I can use the widechar to
test if it's iswalnum, iswalpha, and then tokenize it appropriately.

In Ragel there is no mb/wc correspondence to the alnum/alpha character
classes as far as I can tell.  It would have been nice to be able to
define it like so:
walpha = /./ when { iswalpha(towide(p)); };

With towide() being a wrapper around mbrtowc.  Unfortunately semantic
conditions aren't supported for the unsigned long alphtype.

So my question is, is there an [easy] way to do this?  Ideally it
would be nice to be able to define the acceptance criteria of a
machine to be the same as that of a code block.  In that way, I can
use the built-in widechar support in the C runtime, or use ICU, or
whatnot.

I can always try to explicitly list the mb/wc i'm interested in, but
that means having to implement locale specific code, which sounds
complex to me.

Any ideas would be greatly appreciated,



More information about the ragel-users mailing list