Lexing and Parsing in Ruby.

Ask a Question related to Ruby, Design and Development.

  1. #1

    Default Lexing and Parsing in Ruby.

    Ooh this is so,so sweet....

    If you know anything about lexical analyzers and parsers, read this
    carefully and realize I have done something very very very Good
    here...

    # Here is a lexical analyzer that matches that breaks a
    # Graphviz "Dot" file into tokens...
    # digraph a_name {
    # a -> b;
    # c -> d -> e;
    # f -> {a;b;c}
    # d;
    # };"
    #
    # Yields "digraph","a_name","{","a","->","b",";",....]
    def tokenize(string)
    regexp = %r{ ^ (\s+ | [a-zA-Z_][a-zA-Z0-9_]* | ([-+][0-9\.]+|[0-9\.]+) | -> | \{ | \} | \; ) }x
    match_data = regexp.match( string)
    while match_data
    token = match_data[0]
    yield token unless token =~ %r{ ^\s }x
    match_data = regexp.match( match_data.post_match)
    end
    end


    # Now here is the trick...
    # If we map tokens to characters
    # then parsing becomes a matter of ....
    # Matching a regular Expression again!!
    def parse(string)
    # Build up a list of tokens in an array
    token_list = []
    # Make a one to one map between token type and the array
    token_string = ''

    tokenize(string) do |token|
    token_list << token
    if token == 'digraph'
    token_string += 'd'
    elsif token =~ %r{ [\{\};] }x
    token_string += token
    elsif token == '->'
    token_string += '>'
    elsif
    token_string += 'i'
    end
    end

    # Now our example is a string "di{i>i;i>i>i;i>{i;i;i}i;}"
    regexp = %r{ ^ d i \{ ((i ; | i (> i)+ ; | i > \{ i (; i)* \})*) \} ;? $ }x
    match_data = regexp.match( token_string)
    raise "Parse error '#{token_string}' doesn't match '#{regexp.source}'" unless match_data
    return [token_list[1], match_data[1], token_list[match_data.begin(1)...match_data.end(1)]]
    end

    name, edge_string, edge_list = parse(io)

    # Now edge_string is i>i;i>i>i;i>{i;i;i}i;
    # name is "a_name"
    # edge_list is ["a","->","b",";","c",...]


    John Carter Phone : (64)(3) 358 6639
    Tait Electronics Fax : (64)(3) 359 4632
    PO Box 1645 Christchurch Email : [email]john.carter@tait.co.nz[/email]
    New Zealand

    A Million Monkeys can inflict worse things than just Shakespeare on
    your system.


    John Carter Guest

  2. Similar Questions and Discussions

    1. [ba-rb] BA-rb ( Bay Area Ruby Users Group ) - 'Generating Code in Ruby' by Jack Herrington.
      Count me in again, too. -- Jos Backus _/ _/_/_/ Sunnyvale, CA _/ _/ _/ _/ _/_/_/ _/ _/ _/ _/ jos at...
    2. BA-rb ( Bay Area Ruby Users Group ) - 'Generating Code in Ruby' by Jack Herrington.
      BA-rb (Bay Area Ruby language Users Group) is pleased to announce that it will begin meeting again. The topic for the first meeting will be a...
    3. ruby-talk: 80813 (Rite/Ruby2.0 & Ruby vs OCaml)
      Hope nobody finds this annoying. Somehow I missed this message when it was originally sent, but I thought my reply might be useful to sombody. ...
    4. RubyConf, Ruby Central, Ruby Garden temporary outage
      Hi -- Just to let everyone know that there's a network outage at the host for the RubyConf registration, which also affects Ruby Garden and Ruby...
    5. [ANN] ruby-freedb, ruby-serialport, ruby-mp3info moved to Rubyforge
      http://ruby-freedb.rubyforge.org/ http://ruby-serialport.rubyforge.org/ http://ruby-mp3info.rubyforge.org/ bye! --...
  3. #2

    Default Re: Lexing and Parsing in Ruby.

    John Carter <john.carter@tait.co.nz> skrev i en
    nyhedsmeddelelse:Pine.LNX.4.58.0311191821040.6241@ parore.tait.co.nz...
    > # Here is a lexical analyzer that matches that breaks a
    > # Graphviz "Dot" file into tokens...
    > # digraph a_name {
    > # a -> b;
    > # c -> d -> e;
    > # f -> {a;b;c}
    > # d;
    > # };"
    > #
    > # Yields "digraph","a_name","{","a","->","b",";",....]
    > def tokenize(string)
    Have you seen the LexerInRuby, its a nice example which shows how it
    can be done:
    [url]http://www.rubygarden.org/ruby?LexerInRuby[/url]

    --
    Simon Strandgaard


    Simon Strandgaard Guest

  4. #3

    Default Re: Lexing and Parsing in Ruby.


    "John Carter" <john.carter@tait.co.nz> schrieb im Newsbeitrag
    news:Pine.LNX.4.58.0311191821040.6241@parore.tait. co.nz...
    > Ooh this is so,so sweet....
    >
    > If you know anything about lexical analyzers and parsers, read this
    > carefully and realize I have done something very very very Good
    > here...
    Sorry, I don't see what's so very very very good about it. Will you
    enlighten me?
    > # Here is a lexical analyzer that matches that breaks a
    > # Graphviz "Dot" file into tokens...
    > # digraph a_name {
    > # a -> b;
    > # c -> d -> e;
    > # f -> {a;b;c}
    > # d;
    > # };"
    > #
    > # Yields "digraph","a_name","{","a","->","b",";",....]
    > def tokenize(string)
    > regexp = %r{ ^ (\s+ | [a-zA-Z_][a-zA-Z0-9_]* | ([-+][0-9\.]+|[0-9\.]+)
    | -> | \{ | \} | \; ) }x
    > match_data = regexp.match( string)
    > while match_data
    > token = match_data[0]
    > yield token unless token =~ %r{ ^\s }x
    > match_data = regexp.match( match_data.post_match)
    > end
    > end
    def tokenize(string)
    string.scan( /[a-z_][a-z_\d]* | [+-]?[.\d]+ | -> | [{};]/xi ) do |m|
    yield m
    end
    end
    > # Now here is the trick...
    > # If we map tokens to characters
    > # then parsing becomes a matter of ....
    > # Matching a regular Expression again!!
    I don't know the Graphviz format but this does only work as a real parser
    (i.e. one that verifies the input against a grammar) if that file format
    does not use nesting constructs with arbitrary depth, i.e., if it has a
    regular grammar (vs. context free grammar). This property of the language
    does not change if you replace tokens by shorter tokens.
    > def parse(string)
    > # Build up a list of tokens in an array
    > token_list = []
    > # Make a one to one map between token type and the array
    > token_string = ''
    >
    > tokenize(string) do |token|
    > token_list << token
    > if token == 'digraph'
    > token_string += 'd'
    > elsif token =~ %r{ [\{\};] }x
    > token_string += token
    > elsif token == '->'
    > token_string += '>'
    > elsif
    > token_string += 'i'
    > end
    > end
    Btw, note that string << 'i' is more efficient than string += 'i'.

    And you can combine the two steps with a slight modified regexp, since you
    know what you get during scanning already:

    def tokenize(string)
    tk=[]
    tl=""

    string.scan( /(digraph) | ([a-z_][a-z_\d]*) | ([+-]?[.\d]+) | (->) |
    ([{};])/xi ) do |di,name,num,arr,noise|
    if di
    tk << di
    tl << 'd'
    elsif name
    tk << name
    tl << 'i'
    elsif num
    tk << num
    tl << 'i'
    elsif arr
    tk << arr
    tl << '>'
    elsif noise
    tk << noise
    tl << noise
    else
    # never here
    end
    end

    [tk, tl]
    end

    Now you can do

    irb(main):144:0> tokens, pattern = tokenize( "digraph a_name {
    irb(main):145:1" a -> b;
    irb(main):146:1" c -> d -> e;
    irb(main):147:1" f -> {a;b;c}
    irb(main):148:1" d;
    irb(main):149:1" };" )
    => [["digraph", "a_name", "{", "a", "->", "b", "c", "->", "d", "->", "e",
    "f", "
    ->", "{", "a", "b", "c", "}", "d", "}"], "di{i>ii>i>ii>{iii}i}"]
    irb(main):150:0> tokens
    => ["digraph", "a_name", "{", "a", "->", "b", "c", "->", "d", "->", "e",
    "f", "-
    >", "{", "a", "b", "c", "}", "d", "}"]
    irb(main):151:0> pattern
    => "di{i>ii>i>ii>{iii}i}"
    irb(main):152:0>

    Rest as before.

    Cheers

    robert

    > # Now our example is a string "di{i>i;i>i>i;i>{i;i;i}i;}"
    > regexp = %r{ ^ d i \{ ((i ; | i (> i)+ ; | i > \{ i (; i)* \})*) \} ;?
    $ }x
    > match_data = regexp.match( token_string)
    > raise "Parse error '#{token_string}' doesn't match '#{regexp.source}'"
    unless match_data
    > return [token_list[1], match_data[1],
    token_list[match_data.begin(1)...match_data.end(1)]]
    > end
    >
    > name, edge_string, edge_list = parse(io)
    >
    > # Now edge_string is i>i;i>i>i;i>{i;i;i}i;
    > # name is "a_name"
    > # edge_list is ["a","->","b",";","c",...]
    >
    >
    > John Carter Phone : (64)(3) 358 6639
    > Tait Electronics Fax : (64)(3) 359 4632
    > PO Box 1645 Christchurch Email : [email]john.carter@tait.co.nz[/email]
    > New Zealand
    >
    > A Million Monkeys can inflict worse things than just Shakespeare on
    > your system.
    >
    >
    Robert Klemme Guest

Posting Permissions

  • You may not post new threads
  • You may post replies
  • You may not post attachments
  • You may not edit your posts

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139