tuning/optimizing scanners

Chuck Remes cremes.devl... at mac.com
Fri Oct 5 15:47:25 UTC 2007


I've written a log parsing tool using ragel and ruby. I'm using the  
scanner construct to perform the parsing, but things appear to be  
running very slowly. I fear I may have chosen the wrong methodology  
to parse the log. (And yes, I know ruby isn't the quickest language  
out there...) :-)

The log in question is a set of key/value pairs that look like this  
(this is one line):

Oct  1 09:50:33.37204 [29193]: {market = ICE | type = order |  
order_id = 4 | buy = 1 | price = 80.83 | volume = 1 | date =  
2007-10-01 | time = 09:50:33.37201 | metadata = {l={f=Quote|g=4|j=1| 
sid=8290182729}|ac=289182|cf=2881|ca= 289182}}

I'm uninterested in the date and other data at the line start, so I  
throw it away. I primarily search for the key (e.g. 'market = ') and  
then fgoto another machine to parse the value. Upon hitting a pipe  
character, I fgoto main again and look for another key. I pasted in a  
section of the machine below to illustrate.

Is this the correct approach? Is there a superior method for rapidly  
parsing long text strings? Be gentle with me... I'm new to this stuff.

Unfortunately, each log record is a slightly different format (for a  
total of about 15 different formats). I also can't plan on the key/ 
value pairs showing up in the same order every time.

Any suggestions?

----------- snip here ---------------
	feedcode_name = [0-9a-zA-Z\-]+;
	numbers = [0-9]+;

#####
	feedcode := |*
		spaces;

		'|' => { fgoto main; };

		feedcode_name => { temp[:feedcode] = data[tokstart..tokend-1]; };
		any => {puts "ERR: feedcode #{data[tokstart..tokend-1]}"};
	*|;
#####
	volume := |*
		spaces;

		'|' => { fgoto main; };

		numbers => { temp[:quantity] = data[tokstart..tokend].to_i; };
		any => {puts "ERR: volume #{data[tokstart..tokend]}"};
	*|;
#####
         main := |*
					'module = ' => { fgoto module; };

					'market = ' => { fgoto market; };

					'feedcode = ' => { fgoto feedcode; };

					'type = ' => { fgoto type; };

					'order_id = ' => { fgoto order_id; };

					'buy = ' => { fgoto activity; };

					'price = ' => { fgoto price; };

					'volume = ' => { fgoto volume; };

					'date = ' => { fgoto date; };

					'time = ' => { fgoto time; };

					( numbers | letters | spaces | '\n' | '{' | '}' | other | any );
		
         *|;



More information about the ragel-users mailing list