Ask a Question related to PERL Miscellaneous, Design and Development.
-
Greg Bacon #1
Re: Find string in web page
In article <4628ab88.0307091019.17e73755@posting.google.com >,
Kirk Larsen <spamme@kirklarsen.com> wrote:
: Sounds simple enough. I need to retrieve the source from a web page
: and then find a link in that web page that ends with a string which I
: have stored in a variable. Can someone please post or direct me to a
: sample of how to do this? Thanks!
Try this on for size:
% cat try
#! /usr/local/bin/perl
use strict;
use warnings;
use HTML::Parser;
use LWP::UserAgent;
use URI::URL;
use Data::Dumper;
sub make_parser {
my $inside;
my %attr;
my $text;
my @links;
my $record = sub {
my $state = Dumper {
inside => $inside,
attr => \%attr,
text => $text,
};
my @cond = (
[ sub { $state }, "not inside" ],
[ sub { %attr }, "no attr" ],
[ sub { $attr{href} }, "no href" ],
);
my $ok = 1;
for (@cond) {
my($check,$msg) = @$_;
unless ($check->()) {
warn "$0: $msg:\n$state ";
$ok = 0;
}
}
push @links => [ $text || '<empty>', $attr{href} ] if $ok;
$inside = 0;
%attr = ();
$text = '';
};
my $start_h = sub {
my $tag = shift;
return unless $tag eq 'a';
if ($inside) {
warn "$0: already inside";
$record->();
}
my $attr = shift;
return unless $attr->{href};
%attr = %$attr;
$inside = 1;
};
my $text_h = sub {
return unless $inside;
$text .= shift;
};
my $end_h = sub {
my $tag = shift;
return unless $tag eq 'a';
return unless $inside;
$record->();
};
my $p = HTML::Parser->new(
api_version => 3,
start_h => [ $start_h, "tagname, attr" ],
text_h => [ $text_h, "dtext" ],
end_h => [ $end_h, "tagname" ],
);
($p, sub { @links });
}
sub usage () { "Usage: $0 search-pattern\n" }
## main
die usage unless @ARGV;
my $pat = shift;
my $lookfor = eval { qr/$pat/ };
die "$0: bad pattern: $pat" unless $lookfor;
my $url = "http://www.cpan.org/";
my $ua = LWP::UserAgent->new;
my($p,$links) = make_parser;
# Request document and parse it as it arrives
my $res = $ua->request(
HTTP::Request->new(GET => $url),
sub { $p->parse($_[0]) }
);
my $base = $res->base;
for ($links->()) {
my($text,$href) = @$_;
next unless $text =~ /$lookfor$/;
my $url = url($href, $base)->abs;
$text =~ s/\s+/ /g;
print "$text:\n $url\n";
}
% ./try 's$'
Perl modules:
[url]http://www.cpan.org/modules/index.html[/url]
Perl scripts:
[url]http://www.cpan.org/scripts/index.html[/url]
Perl recent arrivals:
[url]http://www.cpan.org/RECENT.html[/url]
CPAN sites:
[url]http://www.cpan.org/SITES.html[/url]
CPAN sites:
[url]http://mirrors.cpan.org/[/url]
CPAN modules, distributions, and authors:
[url]http://search.cpan.org/[/url]
CPAN Frequently Asked Questions:
[url]http://www.cpan.org/misc/cpan-faq.html[/url]
Perl Mailing Lists:
[url]http://lists.cpan.org/[/url]
Perl Bookmarks:
[url]http://bookmarks.cpan.org/[/url]
% ./try '('
./try: bad pattern: ( at ./try line 95.
Hope this helps,
Greg
--
In a system of full capitalism, there should be (but, historically, has not
yet been) a complete separation of state and economics, in the same way and
for the same reasons as the separation of state and church.
-- Ayn Rand
Greg Bacon Guest
-
How to find second occurence of a string?
Hi i am using the find function which will search the first occurence of a string. but how we can find the second or third ...occurence of the... -
Find and cut string
Hi, I'm new to PHP, so please bear with me! =) Is there a nice little function to find a string between to html-tags? I.e. let's say I have a... -
[PHP] find string
Isn't there an in_array function you can use? If (in_array($action, array(a1,a2,a3,a4)) { // do something } else { // do something else } -
find in string
How can i count the number of times a string appears within another string. Thanks a... -
String Find Function
Are there any functions in SQL Server to find the index of a char in a string i.e. declare @index int set @index = ????('This is my string',... -
Greg Bacon #2
Re: Find string in web page
In article <4628ab88.0307100504.3c4f6f9e@posting.google.com >,
Kirk Larsen <spamme@kirklarsen.com> wrote:
: Can't seem to get it to work. It just outputs nothing. Am I doing
: something wrong, or is there another way? I did print out my search
: string var and verified that it is in the source I'm searching, so
: that's not the problem. Thanks again!
Out of the box, does the code produce the same output as shown in
my followup?
What are you looking for? It looks like I was forcing the match to
be at the end:
next unless $text =~ /$lookfor$/;
If you don't want to look at the end, change that to
next unless $text =~ /$lookfor/;
It would also help if you showed your code, but, as always with
Usenet, cutting-and-pasting megabytes of source code isn't useful.
Greg
--
The greatest dangers to liberty lurk in insidious encroachment by men
of zeal, well-meaning but without understanding.
-- Justice Louis D. Brandeis
Greg Bacon Guest
-
Mina Naguib #3
Re: Find string in web page
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Kirk Larsen wrote:use LWP::Simple;> Sounds simple enough. I need to retrieve the source from a web page
There are a few ways to do this. I prefer HTML::TokeParser;> and then find a link in that web page that ends with a string which I
> have stored in a variable.
> Can someone please post or direct me to a
> sample of how to do this? Thanks!
my $url = 'http://www.freebsd.org';
my $match = 'man.cgi';
use LWP::Simple;
use HTML::TokeParser;
my $document = get($url) || die "Failed to retrieve document\n";
my $parser = HTML::TokeParser->new(\$document);
while ($token = $parser->get_tag("a")) {
if ($token->[1]->{"href"} =~ /$match$/) {
print "I matched $token->[1]->{href}\n";
}
}
For more information, see [url]http://search.cpan.org/dist/HTML-Parser/lib/HTML/TokeParser.pm[/url] and
[url]http://search.cpan.org/dist/libwww-perl/lib/LWP/Simple.pm[/url].
Note that links are often relative, which means you'll often get a link to "something.html" instead
of "http://domain.com/dir/something.html". It'll be up to you to extrapolate the domain and
directory structure of the original URL (and append to it the link data, as well as possibly take
into account any ../.././ calls) to determine the full URL to call next.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.1 (GNU/Linux)
Comment: Using GnuPG with Mozilla - [url]http://enigmail.mozdev.org[/url]
iD8DBQE/DkfieS99pGMif6wRApEdAJwIJrCRTLNOgtsxCSUYCY7NyO6/AgCZATFH
cc0PEq+mFhTbBDrQ/79fah4=
=/K0i
-----END PGP SIGNATURE-----
Mina Naguib Guest



Reply With Quote

