Ask a Question related to PERL Miscellaneous, Design and Development.

  1. #1

    Default extract from html

    hi,
    how can i extract the number between text1 and text2 in input.html
    only the first time it occurs ignoring the rest?
    preferably input.html would be a URL that stops downloading once a
    match has occured, that would save a lot of bandwidth..
    i guess html::parser would provide an option to work with a file while
    it's downloading (?)

    example
    ----

    input.html:

    bla..
    text1 555 text2
    bla
    bla
    text1 6000 text2
    bla
    EOF


    output.txt
    555


    thanks for your help,
    peter
    Lydia Shawn Guest

  2. Similar Questions and Discussions

    1. help data extract to text without html tag
      Hi Ive a page as below and it will save the record to text, but it does not save the file as what I needed. when i open up the file it didplaay...
    2. PHP: extract links AND description from html
      extracting just the links from a webpage is no problem for me -> regex /<a (*)>/i but now i want to extract the link and the discription that...
    3. extract body-content from HTML page online
      Hi everybody, I need to include an online web page in my own one. My first attempt was to include() that page. This way all the HTML framework...
    4. Extract data from table html
      Hi, I would like to extract data from the table attached. Could someone help me to create the regular expression to grab that informations? ...
    5. [PHP] Extract a little string from a Html page ?
      I tried something else but... It doesn't work too :-( . <? php $fichier=implode('',array_map('trim',readfile("http://myurl.com"))); if (eregi...
  3. #2

    Default Re: extract from html

    Hello Peter,
    > how can i extract the number between text1 and text2 in input.html
    > only the first time it occurs ignoring the rest?
    > preferably input.html would be a URL that stops downloading once a
    > match has occured, that would save a lot of bandwidth..
    > i guess html::parser would provide an option to work with a file while
    > it's downloading (?)
    >
    > example
    > ----
    >
    > input.html:
    >
    > bla..
    > text1 555 text2
    > bla
    > bla
    > text1 6000 text2
    > bla
    > EOF
    >
    >
    > output.txt
    > 555
    Assuming you mean 'text1' and 'text2' are html tags, then the following
    example, (which is straight out of the HTML::Parser documentation), will
    do it for you. This example prints out the title text of a html page if
    you supply the page as a filename on the command line, so just change
    the word "title" to the tag name you require:


    #!/usr/bin/perl

    use strict;
    use warnings;
    use HTML::Parser ();

    sub start_handler
    {
    return if shift ne "title";
    my $self = shift;
    $self->handler(text => sub { print shift }, "dtext");
    $self->handler(end => sub { shift->eof if shift eq "title"; },
    "tagname,self");
    }

    my $p = HTML::Parser->new(api_version => 3);
    $p->handler( start => \&start_handler, "tagname,self");
    $p->parse_file(shift || die) || die $!;
    print "\n";

    Simon Taylor Guest

  4. #3

    Default Re: extract from html


    [ comp.lang.perl is not a Newsgroup. Removed ]


    Lydia Shawn <apfeloma@hotmail.com> wrote:

    > Subject: extract from html

    Your post is not about extracting from HTML at all, so that
    seems a strange choice of Subject...

    > how can i extract the number between text1 and text2 in input.html
    > only the first time it occurs ignoring the rest?
    > input.html:
    >
    > bla..
    > text1 555 text2
    > bla
    > bla
    > text1 6000 text2
    > bla
    > EOF

    No HTML there!

    If you read it all into a scalar, then you can just do this pattern
    match on the scalar:

    /text1 (.*?) text2/


    --
    Tad McClellan SGML consulting
    [email]tadmc@augustmail.com[/email] Perl programming
    Fort Worth, Texas
    Tad McClellan Guest

  5. #4

    Default Re: extract from html


    "Lydia Shawn" <apfeloma@hotmail.com> schrieb im Newsbeitrag
    news:1240b4dc.0308051647.685dde59@posting.google.c om...
    > hi,
    > how can i extract the number between text1 and text2 in input.html
    > only the first time it occurs ignoring the rest?
    This problem I would solve by using a Hash. You can just put a unique key
    into it, while finding the same term
    it will be overwritten, or you can ask the hash if the term already exist

    # $term is taken from your text - inbeetween text1 / text2
    if( exists $myHash{$term})
    {
    # ignore
    }else
    {
    $myHash{$term} = $value;
    }

    The Rest of your question : I donīt know ... sorry
    > thanks for your help,
    > peter
    no prob...but what is your real name ?
    "Lydia Shawn" or Peter :-)

    HTH
    greets Michael


    Michael Korte Guest

  6. #5

    Default Re: extract from html

    > Your post is not about extracting from HTML at all, so that
    > seems a strange choice of Subject...
    >
    >>
    > No HTML there!
    >
    > If you read it all into a scalar, then you can just do this pattern
    > match on the scalar:
    >
    > /text1 (.*?) text2/
    yes there is no html in my example,
    my question is more about the function of html::parser working with a
    file and matching things as the file is coming in, and stopping after
    the first match has occured.. to prevent needless downloading. how can
    i do that?
    thanks a lot,
    peter
    Lydia Shawn Guest

  7. #6

    Default Re: extract from html

    >
    > Assuming you mean 'text1' and 'text2' are html tags, then the following
    > example, (which is straight out of the HTML::Parser documentation), will
    > do it for you. This example prints out the title text of a html page if
    > you supply the page as a filename on the command line, so just change
    > the word "title" to the tag name you require:
    >
    >
    thanks simon,
    but the text1/text2 is actual text that occur within <TD> tags which
    are all over the document...
    the reason i would like to use html::parser to do the job is its
    feature, at least the way i understand it, to start matching things
    before the whole file is read in. then, after the match, it should
    stop in order to save bandwidth,
    thaks again,
    peter
    Lydia Shawn Guest

  8. #7

    Default Re: extract from html


    "Lydia Shawn" <apfeloma@hotmail.com> wrote in message
    news:1240b4dc.0308051647.685dde59@posting.google.c om...
    > hi,
    > how can i extract the number between text1 and text2 in input.html
    > only the first time it occurs ignoring the rest?
    > preferably input.html would be a URL that stops downloading once a
    > match has occured, that would save a lot of bandwidth..
    > i guess html::parser would provide an option to work with a file while
    > it's downloading (?)
    Take a look at the lwp-download script (in your perl bin directory)
    as an example of a program that incrementally downloads a URL.
    You can then search the contents for your text1 and text2 and stop if found.

    The script uses LWP::UserAgent to do the download.

    --
    brian


    Brian Helterline Guest

Posting Permissions

  • You may not post new threads
  • You may post replies
  • You may not post attachments
  • You may not edit your posts

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139