Professional Web Applications Themes

HTML parsing - Ruby

Hi folks, I need to p some HTML. I've dug around the archives and so on and found the best solution to be Ned Konz's 'ruby-htmltools', which relies on 'html-pr'. Both of these projects are not really maintained, so I'm wondering what other people currently use. Cheers, Gavin...

  1. #1

    Default HTML parsing

    Hi folks,

    I need to p some HTML. I've dug around the archives and so on and
    found the best solution to be Ned Konz's 'ruby-htmltools', which
    relies on 'html-pr'. Both of these projects are not really
    maintained, so I'm wondering what other people currently use.

    Cheers,
    Gavin



    Gavin Guest

  2. #2

    Default Re: HTML parsing

    Gavin Sinclair wrote:
     
    i was using a home-made solution, but i just decided (this WE) to
    convert it to REXML: I would use HTML tidy (which is already needed for
    ~60% of the pages i'm parsing now), and ask tidy to spit out XHTML. i
    think that's the best (with my home made solution, besides the
    duplication of work of parsing HTML, i needed a list of tags that you
    don't need to close etc. in XHTML all is done for me.. and then i get
    the familiar API of REXML [even though i never used REXML yet :O) ]).

    i think it's the best.

    emmanuel



    Emmanuel Guest

  3. #3

    Default Re: HTML parsing

    Emmanuel Touzery wrote:
     

    (it was needed for many pages due to sloppy/invalid HTML, that tidy is
    correcting)

    emmanuel



    Emmanuel Guest

  4. #4

    Default Re: HTML parsing


    "Gavin Sinclair" <com.au> schrieb im Newsbeitrag
    news:com.au... 

    Last time I needed that I used some kind of home cooked regexp scanning.
    But I didn't need a real pr, just wanted to extract some portion from
    the HTML file.

    robert

    Robert Guest

  5. #5

    Default Re: HTML parsing

    On Monday, February 2, 2004, 11:48:00 PM, Emmanuel wrote:
     
     
    > i was using a home-made solution, but i just decided (this WE) to
    > convert it to REXML: I would use HTML tidy (which is already needed for
    > ~60% of the pages i'm parsing now), and ask tidy to spit out XHTML. i
    > think that's the best (with my home made solution, besides the
    > duplication of work of parsing HTML, i needed a list of tags that you
    > don't need to close etc. in XHTML all is done for me.. and then i get
    > the familiar API of REXML [even though i never used REXML yet :O) ]).[/ref]

    The library I mentioned gives you a REXML::Doent as well, so I'm
    using REXML for the first time. It's very good, but I'm struggling to
    really get a grip.

    The single most useful improvement to REXML for a beginner, IMO, is
    this: (more) reasonable implementations of #to_s and/or #inspect on
    Element and Attribute objects.

    As it is, I believe every element contains a link to its doent,
    which in my case is large, and #inspect spits out thousands of lines
    of rubbish when all I want to see is the element I'm looking at. This
    makes it hard to use in 'irb'.

    (I know, I should start with a small doent, but I'm trying to get
    my task done :)

    Cheers,
    Gavin



    Gavin Guest

Similar Threads

  1. Html parsing
    By j in forum Macromedia Director Lingo
    Replies: 6
    Last Post: April 18th, 01:29 PM
  2. Including / Parsing CF in HTML
    By RazorX in forum Coldfusion - Getting Started
    Replies: 4
    Last Post: March 1st, 01:24 PM
  3. Parsing HTML tags
    By eyebrown@mindspring.com in forum FileMaker
    Replies: 3
    Last Post: November 10th, 05:26 PM
  4. Parsing Html
    By Colum in forum PHP Development
    Replies: 2
    Last Post: October 30th, 10:22 PM

Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139