Find string in web page

Ask a Question related to PERL Miscellaneous, Design and Development.

  1. #1

    Default Re: Find string in web page

    In article <4628ab88.0307091019.17e73755@posting.google.com >,
    Kirk Larsen <spamme@kirklarsen.com> wrote:

    : Sounds simple enough. I need to retrieve the source from a web page
    : and then find a link in that web page that ends with a string which I
    : have stored in a variable. Can someone please post or direct me to a
    : sample of how to do this? Thanks!

    Try this on for size:

    % cat try
    #! /usr/local/bin/perl

    use strict;
    use warnings;

    use HTML::Parser;
    use LWP::UserAgent;
    use URI::URL;
    use Data::Dumper;

    sub make_parser {
    my $inside;
    my %attr;
    my $text;
    my @links;

    my $record = sub {
    my $state = Dumper {
    inside => $inside,
    attr => \%attr,
    text => $text,
    };

    my @cond = (
    [ sub { $state }, "not inside" ],
    [ sub { %attr }, "no attr" ],
    [ sub { $attr{href} }, "no href" ],
    );

    my $ok = 1;
    for (@cond) {
    my($check,$msg) = @$_;

    unless ($check->()) {
    warn "$0: $msg:\n$state ";
    $ok = 0;
    }
    }

    push @links => [ $text || '<empty>', $attr{href} ] if $ok;

    $inside = 0;
    %attr = ();
    $text = '';
    };

    my $start_h = sub {
    my $tag = shift;
    return unless $tag eq 'a';

    if ($inside) {
    warn "$0: already inside";
    $record->();
    }

    my $attr = shift;
    return unless $attr->{href};

    %attr = %$attr;
    $inside = 1;
    };

    my $text_h = sub {
    return unless $inside;

    $text .= shift;
    };

    my $end_h = sub {
    my $tag = shift;
    return unless $tag eq 'a';

    return unless $inside;

    $record->();
    };

    my $p = HTML::Parser->new(
    api_version => 3,
    start_h => [ $start_h, "tagname, attr" ],
    text_h => [ $text_h, "dtext" ],
    end_h => [ $end_h, "tagname" ],
    );

    ($p, sub { @links });
    }

    sub usage () { "Usage: $0 search-pattern\n" }

    ## main
    die usage unless @ARGV;

    my $pat = shift;
    my $lookfor = eval { qr/$pat/ };
    die "$0: bad pattern: $pat" unless $lookfor;

    my $url = "http://www.cpan.org/";
    my $ua = LWP::UserAgent->new;

    my($p,$links) = make_parser;

    # Request document and parse it as it arrives
    my $res = $ua->request(
    HTTP::Request->new(GET => $url),
    sub { $p->parse($_[0]) }
    );

    my $base = $res->base;
    for ($links->()) {
    my($text,$href) = @$_;

    next unless $text =~ /$lookfor$/;

    my $url = url($href, $base)->abs;

    $text =~ s/\s+/ /g;
    print "$text:\n $url\n";
    }
    % ./try 's$'
    Perl modules:
    [url]http://www.cpan.org/modules/index.html[/url]
    Perl scripts:
    [url]http://www.cpan.org/scripts/index.html[/url]
    Perl recent arrivals:
    [url]http://www.cpan.org/RECENT.html[/url]
    CPAN sites:
    [url]http://www.cpan.org/SITES.html[/url]
    CPAN sites:
    [url]http://mirrors.cpan.org/[/url]
    CPAN modules, distributions, and authors:
    [url]http://search.cpan.org/[/url]
    CPAN Frequently Asked Questions:
    [url]http://www.cpan.org/misc/cpan-faq.html[/url]
    Perl Mailing Lists:
    [url]http://lists.cpan.org/[/url]
    Perl Bookmarks:
    [url]http://bookmarks.cpan.org/[/url]
    % ./try '('
    ./try: bad pattern: ( at ./try line 95.

    Hope this helps,
    Greg
    --
    In a system of full capitalism, there should be (but, historically, has not
    yet been) a complete separation of state and economics, in the same way and
    for the same reasons as the separation of state and church.
    -- Ayn Rand
    Greg Bacon Guest

  2. Similar Questions and Discussions

    1. How to find second occurence of a string?
      Hi i am using the find function which will search the first occurence of a string. but how we can find the second or third ...occurence of the...
    2. Find and cut string
      Hi, I'm new to PHP, so please bear with me! =) Is there a nice little function to find a string between to html-tags? I.e. let's say I have a...
    3. [PHP] find string
      Isn't there an in_array function you can use? If (in_array($action, array(a1,a2,a3,a4)) { // do something } else { // do something else }
    4. find in string
      How can i count the number of times a string appears within another string. Thanks a...
    5. String Find Function
      Are there any functions in SQL Server to find the index of a char in a string i.e. declare @index int set @index = ????('This is my string',...
  3. #2

    Default Re: Find string in web page

    In article <4628ab88.0307100504.3c4f6f9e@posting.google.com >,
    Kirk Larsen <spamme@kirklarsen.com> wrote:

    : Can't seem to get it to work. It just outputs nothing. Am I doing
    : something wrong, or is there another way? I did print out my search
    : string var and verified that it is in the source I'm searching, so
    : that's not the problem. Thanks again!

    Out of the box, does the code produce the same output as shown in
    my followup?

    What are you looking for? It looks like I was forcing the match to
    be at the end:

    next unless $text =~ /$lookfor$/;

    If you don't want to look at the end, change that to

    next unless $text =~ /$lookfor/;

    It would also help if you showed your code, but, as always with
    Usenet, cutting-and-pasting megabytes of source code isn't useful.

    Greg
    --
    The greatest dangers to liberty lurk in insidious encroachment by men
    of zeal, well-meaning but without understanding.
    -- Justice Louis D. Brandeis
    Greg Bacon Guest

  4. #3

    Default Re: Find string in web page

    -----BEGIN PGP SIGNED MESSAGE-----
    Hash: SHA1

    Kirk Larsen wrote:
    > Sounds simple enough. I need to retrieve the source from a web page
    use LWP::Simple;
    > and then find a link in that web page that ends with a string which I
    > have stored in a variable.
    There are a few ways to do this. I prefer HTML::TokeParser;
    > Can someone please post or direct me to a
    > sample of how to do this? Thanks!

    my $url = 'http://www.freebsd.org';
    my $match = 'man.cgi';

    use LWP::Simple;
    use HTML::TokeParser;

    my $document = get($url) || die "Failed to retrieve document\n";

    my $parser = HTML::TokeParser->new(\$document);

    while ($token = $parser->get_tag("a")) {
    if ($token->[1]->{"href"} =~ /$match$/) {
    print "I matched $token->[1]->{href}\n";
    }
    }

    For more information, see [url]http://search.cpan.org/dist/HTML-Parser/lib/HTML/TokeParser.pm[/url] and
    [url]http://search.cpan.org/dist/libwww-perl/lib/LWP/Simple.pm[/url].

    Note that links are often relative, which means you'll often get a link to "something.html" instead
    of "http://domain.com/dir/something.html". It'll be up to you to extrapolate the domain and
    directory structure of the original URL (and append to it the link data, as well as possibly take
    into account any ../.././ calls) to determine the full URL to call next.

    -----BEGIN PGP SIGNATURE-----
    Version: GnuPG v1.2.1 (GNU/Linux)
    Comment: Using GnuPG with Mozilla - [url]http://enigmail.mozdev.org[/url]

    iD8DBQE/DkfieS99pGMif6wRApEdAJwIJrCRTLNOgtsxCSUYCY7NyO6/AgCZATFH
    cc0PEq+mFhTbBDrQ/79fah4=
    =/K0i
    -----END PGP SIGNATURE-----

    Mina Naguib Guest

Posting Permissions

  • You may not post new threads
  • You may post replies
  • You may not post attachments
  • You may not edit your posts

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139