[ragel-users] ragel hex codes on x86_64

_why why at whytheluckystiff.net
Fri Jan 16 19:03:22 UTC 2009


Hello, Adrian and fellow accomplices.

I've been doing some UTF-8 scanning with Ragel, using hex codes (as
previously recommended on this list.) This works fine if I compile
the state machine on 32-bit architecture. I can then build on either
32-bit or 64-bit without trouble.

However, on x86_64 (Ubuntu 8.10,) multi-byte UTF-8 characters aren't
accepted by my expression. Again, this isn't too alarming since I'm
able to produce a working machine by generating everything using the
32-bit executable, but I figured you'd want to hear about this.

I'm attaching a simple test case. If you want to see the full thing
in context: <http://github.com/why/potion>.

With fond feelings in the extreme,

_why
-------------- next part --------------
#include <stdio.h>
#include <string.h>

%%{
  machine utf8;
  utf8        = 0x09 | 0x0a | 0x0d | (0x20..0x7e) |
                (0xc2..0xdf) (0x80..0xbf) |
                (0xe0..0xef 0x80..0xbf 0x80..0xbf) |
                (0xf0..0xf4 0x80..0xbf 0x80..0xbf 0x80..0xbf);

  main := |*
    utf8 => { printf("TOKEN: %.*s\n", (int)(te - ts), ts); };
  *|;

  write data nofinal;
}%%

int main()
{
  int cs, act;
  char str[] = "naïve";
  char *p = str, *pe = str + strlen(str);
  char *ts, *te, *eof = 0;

  %% write init;
  %% write exec;

  return 0;
}


More information about the ragel-users mailing list