[ragel-users] ragel and encodings

Wincent Colaiuta win at wincent.com
Thu May 21 19:04:00 UTC 2009


El 21/5/2009, a las 20:51, Wil Macaulay escribió:

> Sorry - I was inaccurate in my previous reply (should have refreshed
> my memory first by looking at
> the code).  On the Mac, the native encoding is Unicode  - that's the
> conceptual basis for the NSString class. There are convenience
> functions for accessing the underlying
> character buffer as unichar - 16 bits unsigned.   So my first step is
> to convert the raw file to an NSString
> as Unicode, then access the character buffer and sent that to my
> parser.  This requires my ragel file to use:
>
> #UniChar type is 16 bits unsigned
> 	
> 	alphtype unsigned short;
>
> Keywords all fall into the standard ASCII charset - anything that is
> outside the ascii character set,
> for me, is only interesting in the context of literals (quoted strings
> and the like).  This means that I can
> write my FSM in the normal fashion.

As far as I know, the native encoding for NSString on Mac OS X is  
UTF-16, which means that the approach you describe will work for most  
input, but fall down for any code points which require surrogates (not  
all code points can be represented in 16 bits, so some of them require  
an additional 16 bits, forming a surrogate pair).

The approach would work fine if the input was in UCS-2 (which always  
fits in 16 bits, but which can't represent all Unicode code points).

So I guess it all depends on the kind of input the original poster is  
expecting. If it's user-supplied (untrusted input) and he wants to  
work with UTF-16 then he should probably gracefully handle surrogates,  
even if he isn't expecting them.

This Wikipedia article explains all this in a lot more detail:

http://en.wikipedia.org/wiki/UTF-16

Wincent





More information about the ragel-users mailing list