Professional Web Applications Themes

Simple parsing of sloppy HTML - LittleLexer - Ruby

There have been a couple of threads on parsing HTML. If you looking for something excruciatingly simple (and simplistic) that will p the sloppiest of HTML pages, you could do worse than look at LittleLexer. http://littlelexer.rubyforge.org/ At a mere 44 lines of non-blank non-comment lines of Ruby, it has the virtue of a certain elegant simplicity. It includes a rudimentary HTML pr as an example. John Carter Phone : (64)(3) 358 6639 Tait Electronics Fax : (64)(3) 359 4632 PO Box 1645 Christchurch Email : co.nz New Zealand Note to all marketers. If you want to sell things to me, ...

  1. #1

    Default Simple parsing of sloppy HTML - LittleLexer

    There have been a couple of threads on parsing HTML.

    If you looking for something excruciatingly simple (and simplistic) that
    will p the sloppiest of HTML pages, you could do worse than look at
    LittleLexer.

    http://littlelexer.rubyforge.org/

    At a mere 44 lines of non-blank non-comment lines of Ruby, it has the
    virtue of a certain elegant simplicity.

    It includes a rudimentary HTML pr as an example.


    John Carter Phone : (64)(3) 358 6639
    Tait Electronics Fax : (64)(3) 359 4632
    PO Box 1645 Christchurch Email : co.nz
    New Zealand

    Note to all marketers. If you want to sell things to me, buy Google words.

    I refuse on principle to buy anything sold by spam or popup and I
    never follow any links found in a spam. I do however use Google and
    will often follow the neat non-irritating Google Word ads that are of
    interest to me.


    John Guest

  2. #2

    Default Re: Simple parsing of sloppy HTML - LittleLexer

    il Wed, 11 Feb 2004 10:59:41 +0900, John Carter
    <co.nz> ha scritto::

     

    I wonder if you're going to add some more doentation. It seem that
    littlelexer is someway interesting, but I could not understand it last
    time I looked at it (note that I'm stupid)
    gabriele Guest

  3. #3

    Default Re: Simple parsing of sloppy HTML - LittleLexer

    John Carter wrote: 

    Cute, but it's not a pr, it's a composite lexer.
    A regular expression engine cannot *p* context-free grammars (like
    most programming languages), even if you can layer them like this.
    Clifford Guest

  4. #4

    Default Re: Simple parsing of sloppy HTML - LittleLexersame

    On Thu, 12 Feb 2004, Clifford Heath wrote:
     
    >
    > Cute, but it's not a pr, it's a composite lexer.
    > A regular expression engine cannot *p* context-free grammars (like
    > most programming languages), even if you can layer them like this.[/ref]


    Yes and no.

    It's a while since I did the theory courses so forgive me if my
    details as to whether something is LR or LL is wrong....

    Note 1: If I remember correctly even a full "pr generator" like yacc/Bison
    cannot recognize an arbitrary CFG. They only recognize a limited subset
    like LL(1)

    Thus LittleLexer is a pr in the same sense that yacc/Bison are,
    but the subset of CFG's it can recognize is more limited.

    Note 2: The fundamental limitation on regex's as a pr is they
    don't cope with nested structures.

    So as a mental exercise I was trying to visualize how I would get
    LittleLexer to chew on that simplest and best poster child example of
    a "nested" language, s-expressions. (Lisp like expressions)

    The answer is quite simple in that case, have a lexer pass, no surprise.
    And then have a pass that recognizes unbracketed regions.
    %r{\([^\(\)]*\)} maps to 'f'

    eg.

    (+ (- 30 20) 10)

    Feed it to the lexer and it outputs...
    '(+(-dd)d)', ['(','+','(','-','30','20',')','10',')']

    Feed it to the pr and it produces....

    '(+fd)', ['(','+','(-dd)','d',')']

    Feed it back into exactly the same little lexer instance and you have...
    'f', '['(+fd)']

    Now there is an option on the LittleLexer scan method that allows you
    to pass in the resulting token list from the previous pass so what you
    get is a AST (Abstract Syntax Tree) of the s-expression. Voila!
    LittleLexer pd a language that is not a regular expression.

    The LispPr as a proof of concept is easy to do. What I want to do
    for version 2 of LitteLexer is generalize the notion of handling
    nested structures like that. But since the whole thing is a "late
    night, can't sleep" project, I'm stilling mulling over what I mean by
    "nested structure".

    For example consider Ruby if/else expressions...

    if a
    dostuff
    if b
    do_more_stuff
    else
    dothing
    end
    end

    Would map to something like....
    if => 'i'
    statement => 's'
    else => 'l'
    end => 'e'
    expression => 'x'

    So that would be...
    'ixsixslsee'

    So we could have a regex that recognized "atomic" if/else statements...
    [/ixs+(ls)?e/, 's']

    For example, a LL(1) pr must be able to decide what path it is
    going down by one token of look ahead. In a sense this iterative
    approach I'm using is a bit better off. I can use the entire string to
    decide whether I have an atomic item or a nested item.

    So why doesn't Bison and the like do this? Because it is slow, the
    whole art and science of compiler compiler's evolved in days of very
    slow computers trying to do far more than they had power for.

    On the other hand it is not too slow as Ruby's regex engine is really
    very good and I'm only manipulating strings and pointers which are
    typically an order of magnitude smaller than the original file.





    John Carter Phone : (64)(3) 358 6639
    Tait Electronics Fax : (64)(3) 359 4632
    PO Box 1645 Christchurch Email : co.nz
    New Zealand

    Note to all marketers. If you want to sell things to me, buy Google words.

    I refuse on principle to buy anything sold by spam or popup and I
    never follow any links found in a spam. I do however use Google and
    will often follow the neat non-irritating Google Word ads that are of
    interest to me.


    John Guest

  5. #5

    Default Re: Simple parsing of sloppy HTML - LittleLexersame

    Regular expressions can't p reentrant definitions, for example
    stmt ::= while expr stmt | if expr stmt | ...

    Sure you can wrap them up in loops and other things, and that's
    what prs do when they wrap themselves around a lexer, but
    you must have a layer that is not a regular expression, or it
    doesn't work. This has been mathematically proven from the
    definition of RE, so if you find a way around it, you don't
    have an RE.
     

    Yes, but if the if/else statement contains another, and that contains
    another, to arbitrary depth, then no RE can p it, with any amount
    of look-ahead.
    Clifford Guest

  6. #6

    Default Re: Simple parsing of sloppy HTML - LittleLexersame

    On Fri, 13 Feb 2004, Clifford Heath wrote:
     

    Correct, and the fact that I can and do p lisp via this trick does
    demonstrate that I can recognize a larger set than a RE.


    John Carter Phone : (64)(3) 358 6639
    Tait Electronics Fax : (64)(3) 359 4632
    PO Box 1645 Christchurch Email : co.nz
    New Zealand

    Note to all marketers. If you want to sell things to me, buy Google words.

    I refuse on principle to buy anything sold by spam or popup and I
    never follow any links found in a spam. I do however use Google and
    will often follow the neat non-irritating Google Word ads that are of
    interest to me.


    John Guest

Similar Threads

  1. [XML::Simple-2.12] problems parsing non ASCII strings
    By Michel Rodriguez in forum PERL Modules
    Replies: 4
    Last Post: July 17th, 12:55 AM
  2. Html parsing
    By j in forum Macromedia Director Lingo
    Replies: 6
    Last Post: April 18th, 01:29 PM
  3. parsing HTML
    By Andrew in forum PERL Beginners
    Replies: 6
    Last Post: July 22nd, 02:15 PM
  4. simple lexing/parsing task
    By Martin in forum Ruby
    Replies: 4
    Last Post: February 10th, 09:05 PM

Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139