Ask a Question related to Perl / CGI, Design and Development.
-
Paul Kraus #1
Html::tokeparser::simple
Someone want to show me how this module can help parse out html?
I want to grap text between <td>text</td> being able to apple regexp to
get what I want.
The problem is my text is among 10,000 td tags. With the only difference
being what the above <th> tag has in it.
So if th tag = then store text between <td> into an array.
Paul
Paul Kraus Guest
-
TokeParser and get_trimmed_text question
Hello, New Perl programmer here. I am using HTML::TokeParser to parse HTML files. It is really very useful. In particular, I use the... -
TokeParser help
Hello, I am a Perl newcomer, and I'm trying to use the TokeParser module to extract text from an HTML file. Here's the Perl code: use... -
HTML::TokeParser . How to ignore tags when 'get_trimmed_text'
Hello. I'm trying to parse an XML file by using HTML::TokeParser : * XML File: <item> <value>This is my <em>house</em></value> <value>This... -
Converting simple text to HTML
Is there a way to convert ASCII text entered into a textbox, for example, into HTML so that newlines will be turned into <BR> tags or similar among... -
HTML::TableExtract Simple question
ksu1wd@mit.edu (Avatar) wrote in news:415d5171.0306261227.3f0fa317 @posting.google.com: I do not think you need to do that. Since... -
R. Joseph Newton #2
Re: Html::tokeparser::simple
Paul Kraus wrote:
Have you looked into HTML::TokeParser? Might be a good place to start.> Someone want to show me how this module can help parse out html?
>
> I want to grap text between <td>text</td> being able to apple regexp to
> get what I want.
>
> The problem is my text is among 10,000 td tags. With the only difference
> being what the above <th> tag has in it.
>
> So if th tag = then store text between <td> into an array.
>
> Paul
Joseph
R. Joseph Newton Guest
-
Drieux #3
Re: Html::tokeparser::simple
On Wednesday, Nov 26, 2003, at 12:30 US/Pacific, Paul Kraus wrote:
> Someone want to show me how this module can help parse out html?
>
> I want to grap text between <td>text</td> being able to apple regexp to
> get what I want.
>
> The problem is my text is among 10,000 td tags. With the only
> difference
> being what the above <th> tag has in it.
>
> So if th tag = then store text between <td> into an array.
my first concern here is did you mean <th> or <tr>?
a simple table would look like:
<table>
<tr>
<th>header1</th>
<th>header2</th>
<th>header3</th>
</tr>
<tr>
<td>_Row_1_Cell_1_</td>
<td>_Row_1_Cell_2_</td>
<td>_Row_1_Cell_3_</td>
</tr>
<tr>
<td>_Row_2_Cell_1_</td>
<td>_Row_2_Cell_2_</td>
<td>_Row_2_Cell_3_</td>
</tr>
<tr>
<td>_Row_3_Cell_1_</td>
<td>_Row_3_Cell_2_</td>
<td>_Row_3_Cell_3_</td>
</tr>
</table>
You have almost written your algorithm
while( my $token = $p->get_token)
{
last if ($token->is_start_tag('table')); }
# there is a Table opening Tag, our hope now is that
# we can get our Keys from the headers
my $count = 0;
my $header = {};
while( my $token = $p->get_token)
{
next if ($token->is_start_tag( qr/t[rd]/)); # don't care
last if ($token->is_end_tag('/tr')); # finished with headers
if ($token->is_end_tag('/td'))
{
$count++;
next;
}
if ( $token->is_text())
{
my $text = $token->as_is();
$header->{$count} = $text
if ( $text =~ <some_pattern>);
}
}
#
# read the first row of headers, now to meander forward
#
At this point we know that IF
if(defined($header->{$count}))
this is a column we have to grot data from
into the storage set up
and that would be basically like the way that we
grotted out the header sections, which is left as
an exercise for the reader.
CAVEAT: simply because it looks like Perl,
does not mean that I have written Perl, or that
the code will actually work. It is merely a demonstration
in algorithm creation.
ciao
drieux
---
Drieux Guest
-
R. Joseph Newton #4
Re: Html::tokeparser::simple
Paul Kraus wrote:
Hi Paul,> Someone want to show me how this module can help parse out html?
>
> I want to grap text between <td>text</td> being able to apple regexp to
> get what I want.
>
> The problem is my text is among 10,000 td tags. With the only difference
> being what the above <th> tag has in it.
>
> So if th tag = then store text between <td> into an array.
>
> Paul
Sorry that earlier response was so dumb. I didn't connect the content with
the subject line, I'm afraid. That may be because it included neither data
illustration nor anything you had tried on your own. Anyway, I hope this
makes up for my negligence a bit.
I'm not sure that HTML::TokeParser::Simple adds anything to the
functionality of HTML::TokeParser for your purposes [at least what you have
described here]. The Simple part mostly has to do with making the tag types
and attributes more transparent. I didn't see much in the docs about the
data itself. Neither module seems all that user friendly, but I got
something along that line working.
With a simple table using headers:
table_test.html:
<html>
<head>
<title> HTML::TokeParser Test </title>
</head>
<body>
<table rows=4 cols=3>
<tr> <th> Key </th> <th> name </th> <th> Address </th> </tr>
<tr>
<td> 1 </td> <td> George </td> <td> farewell </td>
</tr>
<tr>
<td> 2 </td> <td> Abe </td> <td> Gettysburg </td>
</tr>
<tr>
<td> 3 </td> <td> Joseph </td> <td> E-Mail </td>
</tr>
</table>
This [after many hours of near-misses], seemed to work:Greetings!
E:\d_drive\perlStuff>perl -w -MHTML::TokeParser
Greetings! E:\d_drive\perlStuff>perl -w -MHTML::TokeParser
my $tp = HTML::TokeParser->new('table_test.html');
my @fields;
my @test;
my $open_tag;
$open_tag = $tp->get_tag('th');
while ($open_tag and $open_tag->[0] ne '/tr') {
if (my $test = $tp->get_text('/th')) {
push @fields, $test;
$open_tag = $tp->get_tag('th', '/tr')
}
}
my @data;
my $data_start = $tp->get_tag('tr');
while ($data_start) {
my $values = {};
foreach (@fields) {
$tp->get_tag('td');
$values->{$_} = $tp->get_text('/td');
}
push @data, $values;
$data_start = $tp->get_tag('tr');
}
foreach $row (@data) {
print "$_: $row->{$_}; " foreach keys %$row;
print "\n";
}
^Z
Address : farewell ; name : George ; Key : 1 ;
Address : Gettysburg ; name : Abe ; Key : 2 ;
Address : E-Mail ; name : Joseph ; Key : 3 ;
It simple would not come together until I dealt with holiday cooking and
celebrations, though. The main problem I was having was because I had been
trying to do too much in the control blocks of the while loops. These
"shortcuts" kept creating situations where the loop would pass beyond the
desired data and consume the whole file. Doing a priming round, and then
doing a spare test of value in the loop condition helped a lot.
Of course, you still have to have a way to pick the particular row that you
want, a complication that you didn't mention.
Joseph
R. Joseph Newton Guest
-
Unregistered #5
Re: Html::tokeparser::simple
I am using HTML:Tokeparser for my project where it finds out text between <td> tags.But the problem is that there are a number of tags inside the <td> tag.
For example:
When i use this it gives me an answer with [IMG] because the first tag is image tag.I chked out documentation which says use $p->{textify} but i dont know how to use it please guide meHTML Code:<td valign="top" width="364"> <span class="detailsheader"><p>ABC</p></span> <div class=spacer></div> <a href="../redirect/redirect.asp?firm_id=123708&url_clicks=1&url=www.abc.com" target="new"> <img border="0" src="http://www.coroflot.com/user_files/company_logos/1V3mEr0Da.jpg" alt=" ></a><div class="spacer"><br></div> <!--Description--> <div class=spacer></div> ABCDEFGHIJKLMNOPQRSTUVWXYZ <div class=spacer><br></div> <!--Awards--> <!--Professional Affiliations--> <span class="detailsheader"><p>AFFILIATIONS</p></span> </td>
Unregistered Guest
-
gao6530 #6
Re: Html::tokeparser::simple
Of course, you still have to have a way to pick the particular row that you
want, a complication that you didn't mention.
Junior Member
- Join Date
- Dec 2011
- Posts
- 1



Reply With Quote

