[ragel-users] ragel and encodings

Wil Macaulay wil.macaulay at gmail.com
Thu May 21 18:51:31 UTC 2009


Sorry - I was inaccurate in my previous reply (should have refreshed
my memory first by looking at
the code).  On the Mac, the native encoding is Unicode  - that's the
conceptual basis for the NSString class. There are convenience
functions for accessing the underlying
character buffer as unichar - 16 bits unsigned.   So my first step is
to convert the raw file to an NSString
as Unicode, then access the character buffer and sent that to my
parser.  This requires my ragel file to use:

#UniChar type is 16 bits unsigned
	
	alphtype unsigned short;

Keywords all fall into the standard ASCII charset - anything that is
outside the ascii character set,
for me, is only interesting in the context of literals (quoted strings
and the like).  This means that I can
write my FSM in the normal fashion.

Hope this helps

wil



On Thu, May 21, 2009 at 1:48 PM, Robert Lemmen <robertle at semistable.com> wrote:
> On Thu, May 21, 2009 at 11:34:35AM -0400, Wil Macaulay wrote:
>> Depends on your platform, but my approach to this problem (on the Mac)
>> was to detect
>> the encoding, and convert to UTF-8 before parsing. I also converted
>> line-endings (\r\n -> \n)
>> and ensured a newline at the end of the data at the same time.
>
> how do you handle utf-8 in your ragel code? do you use a single-byte
> alphtype and then handle the utf-8 sequences manually?
>
> cu  robert
>
> --
> Robert Lemmen                               http://www.semistable.com
>
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.4.9 (GNU/Linux)
>
> iD8DBQFKFZPkS6AOchRbaWYRAjClAKCy0w4KQNUOxyeA/0l1RUWQyZQKmwCeM3o7
> /CvWvDgAdAJYGDy2VEUBkuo=
> =xX1k
> -----END PGP SIGNATURE-----
>
> _______________________________________________
> ragel-users mailing list
> ragel-users at complang.org
> http://www.complang.org/mailman/listinfo/ragel-users
>
>




More information about the ragel-users mailing list