Ask a Question related to Ruby, Design and Development.
-
John Carter #1
Lexing and Parsing in Ruby.
Ooh this is so,so sweet....
If you know anything about lexical analyzers and parsers, read this
carefully and realize I have done something very very very Good
here...
# Here is a lexical analyzer that matches that breaks a
# Graphviz "Dot" file into tokens...
# digraph a_name {
# a -> b;
# c -> d -> e;
# f -> {a;b;c}
# d;
# };"
#
# Yields "digraph","a_name","{","a","->","b",";",....]
def tokenize(string)
regexp = %r{ ^ (\s+ | [a-zA-Z_][a-zA-Z0-9_]* | ([-+][0-9\.]+|[0-9\.]+) | -> | \{ | \} | \; ) }x
match_data = regexp.match( string)
while match_data
token = match_data[0]
yield token unless token =~ %r{ ^\s }x
match_data = regexp.match( match_data.post_match)
end
end
# Now here is the trick...
# If we map tokens to characters
# then parsing becomes a matter of ....
# Matching a regular Expression again!!
def parse(string)
# Build up a list of tokens in an array
token_list = []
# Make a one to one map between token type and the array
token_string = ''
tokenize(string) do |token|
token_list << token
if token == 'digraph'
token_string += 'd'
elsif token =~ %r{ [\{\};] }x
token_string += token
elsif token == '->'
token_string += '>'
elsif
token_string += 'i'
end
end
# Now our example is a string "di{i>i;i>i>i;i>{i;i;i}i;}"
regexp = %r{ ^ d i \{ ((i ; | i (> i)+ ; | i > \{ i (; i)* \})*) \} ;? $ }x
match_data = regexp.match( token_string)
raise "Parse error '#{token_string}' doesn't match '#{regexp.source}'" unless match_data
return [token_list[1], match_data[1], token_list[match_data.begin(1)...match_data.end(1)]]
end
name, edge_string, edge_list = parse(io)
# Now edge_string is i>i;i>i>i;i>{i;i;i}i;
# name is "a_name"
# edge_list is ["a","->","b",";","c",...]
John Carter Phone : (64)(3) 358 6639
Tait Electronics Fax : (64)(3) 359 4632
PO Box 1645 Christchurch Email : [email]john.carter@tait.co.nz[/email]
New Zealand
A Million Monkeys can inflict worse things than just Shakespeare on
your system.
John Carter Guest
-
[ba-rb] BA-rb ( Bay Area Ruby Users Group ) - 'Generating Code in Ruby' by Jack Herrington.
Count me in again, too. -- Jos Backus _/ _/_/_/ Sunnyvale, CA _/ _/ _/ _/ _/_/_/ _/ _/ _/ _/ jos at... -
BA-rb ( Bay Area Ruby Users Group ) - 'Generating Code in Ruby' by Jack Herrington.
BA-rb (Bay Area Ruby language Users Group) is pleased to announce that it will begin meeting again. The topic for the first meeting will be a... -
ruby-talk: 80813 (Rite/Ruby2.0 & Ruby vs OCaml)
Hope nobody finds this annoying. Somehow I missed this message when it was originally sent, but I thought my reply might be useful to sombody. ... -
RubyConf, Ruby Central, Ruby Garden temporary outage
Hi -- Just to let everyone know that there's a network outage at the host for the RubyConf registration, which also affects Ruby Garden and Ruby... -
[ANN] ruby-freedb, ruby-serialport, ruby-mp3info moved to Rubyforge
http://ruby-freedb.rubyforge.org/ http://ruby-serialport.rubyforge.org/ http://ruby-mp3info.rubyforge.org/ bye! --... -
Simon Strandgaard #2
Re: Lexing and Parsing in Ruby.
John Carter <john.carter@tait.co.nz> skrev i en
nyhedsmeddelelse:Pine.LNX.4.58.0311191821040.6241@ parore.tait.co.nz...Have you seen the LexerInRuby, its a nice example which shows how it> # Here is a lexical analyzer that matches that breaks a
> # Graphviz "Dot" file into tokens...
> # digraph a_name {
> # a -> b;
> # c -> d -> e;
> # f -> {a;b;c}
> # d;
> # };"
> #
> # Yields "digraph","a_name","{","a","->","b",";",....]
> def tokenize(string)
can be done:
[url]http://www.rubygarden.org/ruby?LexerInRuby[/url]
--
Simon Strandgaard
Simon Strandgaard Guest
-
Robert Klemme #3
Re: Lexing and Parsing in Ruby.
"John Carter" <john.carter@tait.co.nz> schrieb im Newsbeitrag
news:Pine.LNX.4.58.0311191821040.6241@parore.tait. co.nz...Sorry, I don't see what's so very very very good about it. Will you> Ooh this is so,so sweet....
>
> If you know anything about lexical analyzers and parsers, read this
> carefully and realize I have done something very very very Good
> here...
enlighten me?
| -> | \{ | \} | \; ) }x> # Here is a lexical analyzer that matches that breaks a
> # Graphviz "Dot" file into tokens...
> # digraph a_name {
> # a -> b;
> # c -> d -> e;
> # f -> {a;b;c}
> # d;
> # };"
> #
> # Yields "digraph","a_name","{","a","->","b",";",....]
> def tokenize(string)
> regexp = %r{ ^ (\s+ | [a-zA-Z_][a-zA-Z0-9_]* | ([-+][0-9\.]+|[0-9\.]+)def tokenize(string)> match_data = regexp.match( string)
> while match_data
> token = match_data[0]
> yield token unless token =~ %r{ ^\s }x
> match_data = regexp.match( match_data.post_match)
> end
> end
string.scan( /[a-z_][a-z_\d]* | [+-]?[.\d]+ | -> | [{};]/xi ) do |m|
yield m
end
end
I don't know the Graphviz format but this does only work as a real parser> # Now here is the trick...
> # If we map tokens to characters
> # then parsing becomes a matter of ....
> # Matching a regular Expression again!!
(i.e. one that verifies the input against a grammar) if that file format
does not use nesting constructs with arbitrary depth, i.e., if it has a
regular grammar (vs. context free grammar). This property of the language
does not change if you replace tokens by shorter tokens.
Btw, note that string << 'i' is more efficient than string += 'i'.> def parse(string)
> # Build up a list of tokens in an array
> token_list = []
> # Make a one to one map between token type and the array
> token_string = ''
>
> tokenize(string) do |token|
> token_list << token
> if token == 'digraph'
> token_string += 'd'
> elsif token =~ %r{ [\{\};] }x
> token_string += token
> elsif token == '->'
> token_string += '>'
> elsif
> token_string += 'i'
> end
> end
And you can combine the two steps with a slight modified regexp, since you
know what you get during scanning already:
def tokenize(string)
tk=[]
tl=""
string.scan( /(digraph) | ([a-z_][a-z_\d]*) | ([+-]?[.\d]+) | (->) |
([{};])/xi ) do |di,name,num,arr,noise|
if di
tk << di
tl << 'd'
elsif name
tk << name
tl << 'i'
elsif num
tk << num
tl << 'i'
elsif arr
tk << arr
tl << '>'
elsif noise
tk << noise
tl << noise
else
# never here
end
end
[tk, tl]
end
Now you can do
irb(main):144:0> tokens, pattern = tokenize( "digraph a_name {
irb(main):145:1" a -> b;
irb(main):146:1" c -> d -> e;
irb(main):147:1" f -> {a;b;c}
irb(main):148:1" d;
irb(main):149:1" };" )
=> [["digraph", "a_name", "{", "a", "->", "b", "c", "->", "d", "->", "e",
"f", "
->", "{", "a", "b", "c", "}", "d", "}"], "di{i>ii>i>ii>{iii}i}"]
irb(main):150:0> tokens
=> ["digraph", "a_name", "{", "a", "->", "b", "c", "->", "d", "->", "e",
"f", "-irb(main):151:0> pattern>", "{", "a", "b", "c", "}", "d", "}"]
=> "di{i>ii>i>ii>{iii}i}"
irb(main):152:0>
Rest as before.
Cheers
robert
$ }x> # Now our example is a string "di{i>i;i>i>i;i>{i;i;i}i;}"
> regexp = %r{ ^ d i \{ ((i ; | i (> i)+ ; | i > \{ i (; i)* \})*) \} ;?unless match_data> match_data = regexp.match( token_string)
> raise "Parse error '#{token_string}' doesn't match '#{regexp.source}'"token_list[match_data.begin(1)...match_data.end(1)]]> return [token_list[1], match_data[1],> end
>
> name, edge_string, edge_list = parse(io)
>
> # Now edge_string is i>i;i>i>i;i>{i;i;i}i;
> # name is "a_name"
> # edge_list is ["a","->","b",";","c",...]
>
>
> John Carter Phone : (64)(3) 358 6639
> Tait Electronics Fax : (64)(3) 359 4632
> PO Box 1645 Christchurch Email : [email]john.carter@tait.co.nz[/email]
> New Zealand
>
> A Million Monkeys can inflict worse things than just Shakespeare on
> your system.
>
>Robert Klemme Guest



Reply With Quote

