Professional Web Applications Themes

HTML Parsing? - Ruby

Hi all, I need to access an http server and interpret som data from the page i get back (basically for some minimal tests of a website). I know that I can use the Net::HTTP class to connect and retrieve the page, but then I am left with a string full of stuff. What do people use to p this into something useful? Is REXML an option (although the html is not likely to be valid xml)? I have looked at the html-pr on RAA but do not seem to be able to individually access the components of the returned ...

  1. #1

    Default HTML Parsing?


    Hi all,

    I need to access an http server and interpret som data from the page i get
    back (basically for some minimal tests of a website). I know that I can use
    the Net::HTTP class to connect and retrieve the page, but then I am left with
    a string full of stuff.

    What do people use to p this into something useful? Is REXML an option
    (although the html is not likely to be valid xml)? I have looked at the
    html-pr on RAA but do not seem to be able to individually access the
    components of the returned page (for example I need to see what the contents
    of a text control are - or what the caption of the <h2> tag is.

    I suppose using regexps is an option as well, but just wondering if I am
    missing some cool library that already does all this stuff?

    Thanks for any advice

    Martin

    --
    Martin Hart
    Arnclan Limited
    53 Union Street
    Dunstable, Beds
    LU6 1EX
    http://www.arnclanit.com




    Martin Guest

  2. #2

    Default Re: HTML Parsing?

    On Thursday 05 of February 2004 18:24, Martin Hart wrote: 

    see the thread at
    http://www.ruby-talk.org/cgi-bin/vframe.rb/ruby/ruby-talk/91265?91157-91621+split-mode-vertical

    emmanuel


    Emmanuel Guest

  3. #3

    Default Re: HTML Parsing?

    On Friday, February 6, 2004, 5:39:15 AM, Dave wrote:

     [/ref]
     
     

    For the OP: you can use the above library to convert HTML into a
    REXML::Doent, then pull it apart as you please.

    Gavin



    Gavin Guest

  4. #4

    Default Re: HTML Parsing?

    On Thursday 05 February 2004 21:02, Gavin Sinclair wrote: 
    >
    > For the OP: you can use the above library to convert HTML into a
    > REXML::Doent, then pull it apart as you please.
    >
    > Gavin[/ref]

    thanks for all the advice - I can't believe that I missed the similar thread
    started by Gavin only 4 days ago :-(

    Sorry for the noise.

    Cheers,
    Martin

    --
    Martin Hart
    Arnclan Limited
    53 Union Street
    Dunstable, Beds
    LU6 1EX
    http://www.arnclanit.com




    Martin Guest

  5. #5

    Default Re: HTML Parsing (round 2)

    On Friday 06 February 2004 12:40, Martin Hart wrote: 
    > >
    > > For the OP: you can use the above library to convert HTML into a
    > > REXML::Doent, then pull it apart as you please.
    > >
    > > Gavin[/ref][/ref]

    OK feel free to call me an idiot here, but what versions of html-pr and
    htmltools are you running?

    I downloaded both the html-pr and the patched-html-pr from RAA which
    installed themselves into site_ruby/ (not where i'd expect them -
    site_ruby/1.8/...). I did this because htmltools appears to depend on one of
    them - although not mentioned in the README (version 1.06)

    Then I downloaded htmltools from rubyforge which first fails to install
    because the sgml-pr.rb file is not in "html/sgml-pr" which is where
    it is supposed(?) to be.

    Anyway, after moving files to where I presume they should be installed to, the
    htmltools library fails to install because the tests do not run (all 15 unit
    tests fail with "NameError: uninitialized constant
    HTML::TestStackingPr").

    My environment is ruby 1.8.1 linux.

    My next step is to just install the files by hand and then try again - but I
    would be interested to hear if anybody else has experienced similar
    installation problems - or if I am just missing something obvious?

    Cheers,
    Martin




    Martin Guest

  6. #6

    Default Re: HTML Parsing (round 2)

    On Saturday, February 7, 2004, 1:06:19 AM, Martin wrote:
     [/ref][/ref]
     

    I got my stuff from http://bike-nomad.com/ruby/ and its linked
    resources.

    Cheers,
    Gavin



    Gavin Guest

  7. #7

    Default Re: HTML Parsing (round 2)


    cc: Johannes Brodwall (email)
    ------------------------------

    In comp.lang.ruby NG / ruby-talk ML, "Martin Hart" wrote:
     

    In ruby-htmltools/test/tc_stacking-pr.rb, replace
    line 35:
    pr = HTML::TestStackingPr.new(true, self)
    with:
    pr = TestStackingPr.new(true, self)

    ----
    Occurrences of "set_up" have been changed to "setup" for Test::Unit.
    For consistency, all "tear_down" should be changed to "teardown".
    ----

    It seems that Johannes' idea is to include sgml-pr with the
    updated htmltools library. (It's in his CVS tarball)
    IMHO, this would make a good home for the whole of html-pr (patched)
    (only 31Kb including docs). As long as the original author and packager
    are credited in the README, I don't know that anyone would object on grounds
    other than duplication of dormant library code. Development could be
    continued here.

    There should be no need (?) to distribute the RDoc output now that it's
    built into Ruby.


    daz



    daz Guest

  8. #8

    Default Re: HTML Parsing (round 2)

    On Sunday 08 February 2004 10:40, daz wrote: 

    Thanks - I got there in the end anyway by manually installing all the files I
    had downloaded and tweaking them as necessary.

    Just to append a note to the mini thread that started on packaging as a result
    of this... while a packaging system with all the works would be great, It
    seems to me what is really needed soonest is a definitive place where we can
    take downloads from. I got the versions of code that I am using from RAA...

    Where I came unstuck is that there appear to be two different(?) versions of
    ruby-htmltools. One by Ned Konz that is linked to from RAA, and one by
    Johannes Brodwall that is on rubyforge. I don't know the history of these -
    it may well be that they are the same product that has changed ownership etc,
    but it does cause confusion (at least in my case :-) when two people download
    the same thing from two different places. There is no common frame of
    reference, we think that we are talking about the same code but we may not
    be.


    Cheers,
    Martin

    --
    Martin Hart
    Arnclan Limited
    53 Union Street
    Dunstable, Beds
    LU6 1EX
    http://www.arnclanit.com




    Martin Guest

  9. #9

    Default Re: HTML Parsing (round 2)

    On Sunday, February 8, 2004, 11:38:24 PM, Martin wrote:
     [/ref]
     
     
     

    True, but this is an isolated case. I've never seen so much
    fragmentation with a Ruby library as I've seen with htmltools :)

    Since there is an htmltools project on RubyForge, that should become
    the definitive one, once it's ensured that it's fully up to date.
    I'll be doing more HTML parsing fairly soon, so I'll try to do my bit
    in this area.

    Cheers,
    Gavin



    Gavin Guest

  10. #10

    Default Re: HTML Parsing (round 2)


    "Martin Hart" wrote: 

    Until I read this thread, I was unaware of 1.06 on RubyForge which *is*
    an "updated for Ruby 1.8" version of 1.04 from RAA. ((garbage sentence))

    This problem isn't too common atm, but you're right - this example is in a bit
    of a mess. The issue is understood by those who matter.

    Sorry we had to share the same inconvenience :)


    Cheers,


    daz



    daz Guest

  11. #11

    Default Re: HTML Parsing (round 2)

    "daz" <karoo.co.uk> wrote in message 

    Thank you all for the feedback, and especially to daz for alerting me
    directly (I haven't paid attention to ruby-talk lately).

    I have updated the tarball to include sgml-pr. Sorry about the
    slip-up.

    I will not have time to work much on the project for long. If anyone
    wants to lend a hand, please speak up.


    ~Johannes
    Johannes Guest

  12. #12

    Default Re: HTML Parsing (round 2)


    "Johannes Brodwall" wrote:
     

    That's greatly appreciated, Johannes, thank you for
    this and for your previous updates to this library.

    http://rubyforge.org/projects/ruby-htmltools/
    (Version 1.07)


    daz



    daz Guest

Similar Threads

  1. Html parsing
    By j in forum Macromedia Director Lingo
    Replies: 6
    Last Post: April 18th, 01:29 PM
  2. Including / Parsing CF in HTML
    By RazorX in forum Coldfusion - Getting Started
    Replies: 4
    Last Post: March 1st, 01:24 PM
  3. HTML parsing
    By Gavin in forum Ruby
    Replies: 4
    Last Post: February 2nd, 02:03 PM
  4. Parsing HTML tags
    By eyebrown@mindspring.com in forum FileMaker
    Replies: 3
    Last Post: November 10th, 05:26 PM
  5. Parsing Html
    By Colum in forum PHP Development
    Replies: 2
    Last Post: October 30th, 10:22 PM

Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139