Html::tokeparser::simple

Ask a Question related to Perl / CGI, Design and Development.

  1. #1

    Default Html::tokeparser::simple

    Someone want to show me how this module can help parse out html?

    I want to grap text between <td>text</td> being able to apple regexp to
    get what I want.

    The problem is my text is among 10,000 td tags. With the only difference
    being what the above <th> tag has in it.

    So if th tag = then store text between <td> into an array.

    Paul

    Paul Kraus Guest

  2. Similar Questions and Discussions

    1. TokeParser and get_trimmed_text question
      Hello, New Perl programmer here. I am using HTML::TokeParser to parse HTML files. It is really very useful. In particular, I use the...
    2. TokeParser help
      Hello, I am a Perl newcomer, and I'm trying to use the TokeParser module to extract text from an HTML file. Here's the Perl code: use...
    3. HTML::TokeParser . How to ignore tags when 'get_trimmed_text'
      Hello. I'm trying to parse an XML file by using HTML::TokeParser : * XML File: <item> <value>This is my <em>house</em></value> <value>This...
    4. Converting simple text to HTML
      Is there a way to convert ASCII text entered into a textbox, for example, into HTML so that newlines will be turned into <BR> tags or similar among...
    5. HTML::TableExtract Simple question
      ksu1wd@mit.edu (Avatar) wrote in news:415d5171.0306261227.3f0fa317 @posting.google.com: I do not think you need to do that. Since...
  3. #2

    Default Re: Html::tokeparser::simple

    Paul Kraus wrote:
    > Someone want to show me how this module can help parse out html?
    >
    > I want to grap text between <td>text</td> being able to apple regexp to
    > get what I want.
    >
    > The problem is my text is among 10,000 td tags. With the only difference
    > being what the above <th> tag has in it.
    >
    > So if th tag = then store text between <td> into an array.
    >
    > Paul
    Have you looked into HTML::TokeParser? Might be a good place to start.

    Joseph

    R. Joseph Newton Guest

  4. #3

    Default Re: Html::tokeparser::simple


    On Wednesday, Nov 26, 2003, at 12:30 US/Pacific, Paul Kraus wrote:
    > Someone want to show me how this module can help parse out html?
    >
    > I want to grap text between <td>text</td> being able to apple regexp to
    > get what I want.
    >
    > The problem is my text is among 10,000 td tags. With the only
    > difference
    > being what the above <th> tag has in it.
    >
    > So if th tag = then store text between <td> into an array.

    my first concern here is did you mean <th> or <tr>?

    a simple table would look like:
    <table>
    <tr>
    <th>header1</th>
    <th>header2</th>
    <th>header3</th>
    </tr>
    <tr>
    <td>_Row_1_Cell_1_</td>
    <td>_Row_1_Cell_2_</td>
    <td>_Row_1_Cell_3_</td>
    </tr>
    <tr>
    <td>_Row_2_Cell_1_</td>
    <td>_Row_2_Cell_2_</td>
    <td>_Row_2_Cell_3_</td>
    </tr>
    <tr>
    <td>_Row_3_Cell_1_</td>
    <td>_Row_3_Cell_2_</td>
    <td>_Row_3_Cell_3_</td>
    </tr>
    </table>

    You have almost written your algorithm

    while( my $token = $p->get_token)
    {
    last if ($token->is_start_tag('table')); }

    # there is a Table opening Tag, our hope now is that
    # we can get our Keys from the headers

    my $count = 0;
    my $header = {};

    while( my $token = $p->get_token)
    {
    next if ($token->is_start_tag( qr/t[rd]/)); # don't care
    last if ($token->is_end_tag('/tr')); # finished with headers
    if ($token->is_end_tag('/td'))
    {
    $count++;
    next;
    }
    if ( $token->is_text())
    {
    my $text = $token->as_is();
    $header->{$count} = $text
    if ( $text =~ <some_pattern>);
    }
    }

    #
    # read the first row of headers, now to meander forward
    #
    At this point we know that IF

    if(defined($header->{$count}))
    this is a column we have to grot data from
    into the storage set up

    and that would be basically like the way that we
    grotted out the header sections, which is left as
    an exercise for the reader.

    CAVEAT: simply because it looks like Perl,
    does not mean that I have written Perl, or that
    the code will actually work. It is merely a demonstration
    in algorithm creation.

    ciao
    drieux

    ---

    Drieux Guest

  5. #4

    Default Re: Html::tokeparser::simple

    Paul Kraus wrote:
    > Someone want to show me how this module can help parse out html?
    >
    > I want to grap text between <td>text</td> being able to apple regexp to
    > get what I want.
    >
    > The problem is my text is among 10,000 td tags. With the only difference
    > being what the above <th> tag has in it.
    >
    > So if th tag = then store text between <td> into an array.
    >
    > Paul
    Hi Paul,

    Sorry that earlier response was so dumb. I didn't connect the content with
    the subject line, I'm afraid. That may be because it included neither data
    illustration nor anything you had tried on your own. Anyway, I hope this
    makes up for my negligence a bit.

    I'm not sure that HTML::TokeParser::Simple adds anything to the
    functionality of HTML::TokeParser for your purposes [at least what you have
    described here]. The Simple part mostly has to do with making the tag types
    and attributes more transparent. I didn't see much in the docs about the
    data itself. Neither module seems all that user friendly, but I got
    something along that line working.

    With a simple table using headers:
    table_test.html:
    <html>
    <head>
    <title> HTML::TokeParser Test </title>
    </head>

    <body>
    <table rows=4 cols=3>
    <tr> <th> Key </th> <th> name </th> <th> Address </th> </tr>
    <tr>
    <td> 1 </td> <td> George </td> <td> farewell </td>
    </tr>
    <tr>
    <td> 2 </td> <td> Abe </td> <td> Gettysburg </td>
    </tr>
    <tr>
    <td> 3 </td> <td> Joseph </td> <td> E-Mail </td>
    </tr>
    </table>

    This [after many hours of near-misses], seemed to work:Greetings!
    E:\d_drive\perlStuff>perl -w -MHTML::TokeParser
    Greetings! E:\d_drive\perlStuff>perl -w -MHTML::TokeParser
    my $tp = HTML::TokeParser->new('table_test.html');
    my @fields;

    my @test;

    my $open_tag;
    $open_tag = $tp->get_tag('th');
    while ($open_tag and $open_tag->[0] ne '/tr') {
    if (my $test = $tp->get_text('/th')) {
    push @fields, $test;
    $open_tag = $tp->get_tag('th', '/tr')
    }
    }

    my @data;
    my $data_start = $tp->get_tag('tr');
    while ($data_start) {
    my $values = {};
    foreach (@fields) {
    $tp->get_tag('td');
    $values->{$_} = $tp->get_text('/td');
    }
    push @data, $values;
    $data_start = $tp->get_tag('tr');
    }

    foreach $row (@data) {
    print "$_: $row->{$_}; " foreach keys %$row;
    print "\n";
    }
    ^Z
    Address : farewell ; name : George ; Key : 1 ;
    Address : Gettysburg ; name : Abe ; Key : 2 ;
    Address : E-Mail ; name : Joseph ; Key : 3 ;


    It simple would not come together until I dealt with holiday cooking and
    celebrations, though. The main problem I was having was because I had been
    trying to do too much in the control blocks of the while loops. These
    "shortcuts" kept creating situations where the loop would pass beyond the
    desired data and consume the whole file. Doing a priming round, and then
    doing a spare test of value in the loop condition helped a lot.

    Of course, you still have to have a way to pick the particular row that you
    want, a complication that you didn't mention.

    Joseph



    R. Joseph Newton Guest

  6. #5

    Smile Re: Html::tokeparser::simple

    I am using HTML:Tokeparser for my project where it finds out text between <td> tags.But the problem is that there are a number of tags inside the <td> tag.

    For example:

    HTML Code:
    <td  valign="top" width="364">
    
    	   <span class="detailsheader"><p>ABC</p></span>
    	   <div class=spacer></div>
    					<a href="../redirect/redirect.asp?firm_id=123708&url_clicks=1&url=www.abc.com" target="new">
    				<img border="0" src="http://www.coroflot.com/user_files/company_logos/1V3mEr0Da.jpg" alt=" &nbsp;></a><div class="spacer"><br></div>
    
    	  <!--Description-->
    	  <div class=spacer></div>
    			ABCDEFGHIJKLMNOPQRSTUVWXYZ
    
    	  <div class=spacer><br></div>
    	  <!--Awards-->
    		
    
    
    	  <!--Professional Affiliations-->
    		
    			<span class="detailsheader"><p>AFFILIATIONS</p></span>
    			
    		</td>
    When i use this it gives me an answer with [IMG] because the first tag is image tag.I chked out documentation which says use $p->{textify} but i dont know how to use it please guide me
    Unregistered Guest

  7. #6

    Default Re: Html::tokeparser::simple

    Of course, you still have to have a way to pick the particular row that you
    want, a complication that you didn't mention.
    gao6530 is offline Junior Member
    Join Date
    Dec 2011
    Posts
    1

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139