[ragel-users] Re: 'string' ranges

Paul r.lp... at gmail.com
Sat Apr 7 08:54:09 UTC 2007


> would you mind sending a message to the list to say how it went?
Thanks Adrian, yeah I will remember to do that when I get to a good solution.

> # 0x0A07-0x0D40
> r2 =
> 	0x0A ( 0x07 .. 0xFF ) |
> 	( 0x0B | 0x0C ) any |
> 	0x0D ( 0x00 .. 0x40 );
For now, I will attempt to run with this...

I have a script taking my unicode ranges, converting them to UTF-8 character 
ranges.. and then running those through a hackish set of regular expressions 
to get something like

(
	(0xE4 0xB8 (0x80 .. 0xFF)) |
	((0xE5 .. 0xE8) any{2}) |
	(0xE9 (((0x00 .. 0xBD) any) |
	(0xBE (0x00 .. 0xA5))))
)

from the unicode range [0x4E00-0x9FA5]

> You're probably aware of this but I'll mention it just to put it out
> there ... [snip]
Yeah, although, the majority of my transitions are on ascii characters and I'm 
only wanting to handle proper UTF-8 strings instead of (any -- ascii)* to be 
ultra neurotic in a few cases. I wanted to see what happens to the number of 
states and average performance before moving to another UTF character set or 
abandoning the extra correctness.

Thanks again,

 - Paul



More information about the ragel-users mailing list