Ask a Question related to PERL Miscellaneous, Design and Development.

  1. #1

    Default regex help!

    Hello,

    I am trying to extract email addresses from about 1000 htm files.

    So far am trying

    if ($line =~ /Mailto:(.*)"/ {
    print OUT ("$1 \n");

    where the line is

    <a href="mailto:fred@aol.com"

    problem is with the " after the email address and the "greedy" regex
    characteristic which finds other " further along the line ...

    can I stop at the first " mark?

    Cheers

    Geoff
    Geoff Cox Guest

  2. Similar Questions and Discussions

    1. Regex help
      I'd like to replace any html tags containing "< >" with a space. For example, <TR VALIGN=TOP>, I'd like to replace that with a space. Is there a...
    2. REGEX help pls
      in the regex buddy they are explaining: "Be careful when using the negated shorthands inside square brackets. is not the same as . The latter...
    3. Regex..
      Could some good samaritan help me out with this pls... I am trying to find a regular expression for the below string.. ExchangeName =...
    4. Need help with regex
      > I have a directory of files that I want to move to another directory.
    5. IP regex?
      Gareth Glaccum wrote: How about using m/^(\d+)\.(\d+)\.(\d+)\.(\d+)$/ and testing $1 - $4 for compliance? Much cleaner. -- Cheers,
  3. #2

    Default Re: regex help!

    In article <lle5mvgf13133a618g0s3h560bpi1c535e@4ax.com>, Geoff Cox wrote:
    > Hello,
    >
    > I am trying to extract email addresses from about 1000 htm files.
    E-mail address harvesting on your spare time, are you?
    > if ($line =~ /Mailto:(.*)"/ {
    > print OUT ("$1 \n");
    [cut]
    > problem is with the " after the email address and the "greedy" regex
    > characteristic which finds other " further along the line ...
    Read the perlre manual about changing the "greediness" of a
    quantifier with "?".


    --
    Andreas Kähäri
    Andreas Kahari Guest

  4. #3

    Default Re: regex help!

    In article <lle5mvgf13133a618g0s3h560bpi1c535e@4ax.com>,
    Geoff Cox <geoff.cox@blueyonder.co.uk> wrote:
    > Hello,
    >
    > I am trying to extract email addresses from about 1000 htm files.
    >
    > So far am trying
    >
    > if ($line =~ /Mailto:(.*)"/ {
    > print OUT ("$1 \n");
    >
    > where the line is
    >
    > <a href="mailto:fred@aol.com"
    >
    > problem is with the " after the email address and the "greedy" regex
    > characteristic which finds other " further along the line ...
    >
    > can I stop at the first " mark?
    /Mailto:(.*?)"/

    you know that won't match your example don't you? unless you add the 'i'
    flag (for 'i'gnore case):


    /Mailto:(.*?)"/i

    hth-

    --
    Michael Budash
    Michael Budash Guest

  5. #4

    Default Re: regex help!

    On Sat, 13 Sep 2003 07:33:31 GMT, Michael Budash <mbudash@sonic.net>
    wrote:
    >/Mailto:(.*?)"/
    >
    >you know that won't match your example don't you? unless you add the 'i'
    >flag (for 'i'gnore case):
    Michael,

    Thanks for the help - following code works now but I get the error
    message "uninitialized value in string ne at ... the line with a **
    below - do you knwo why?

    Cheers

    Geoff

    use warnings;
    use strict;

    use File::Find;

    open (OUT, ">>out");

    my $dir = 'c:/atemp1/directory';

    find ( sub {

    open (IN, "$_");
    my $line = <IN>;
    ** while ($line ne "") {
    if ($line =~ /Mailto:(.*?)"/i) {
    print OUT ("$1 \n");
    }
    $line = <IN>;
    }

    }, $dir);

    close (OUT);

    >
    >/Mailto:(.*?)"/i
    >
    >hth-
    Geoff Cox Guest

  6. #5

    Default Re: regex help!

    In article <sfk5mvo202ccnl1b8tjv634fut1qvdo1nf@4ax.com>, Geoff Cox wrote:
    [cut]
    > Thanks for the help - following code works now but I get the error
    > message "uninitialized value in string ne at ... the line with a **
    > below - do you knwo why?
    [cut]
    > open (IN, "$_");
    > my $line = <IN>;
    > ** while ($line ne "") {
    > if ($line =~ /Mailto:(.*?)"/i) {
    > print OUT ("$1 \n");
    [cut]


    What happens at the end of a file? Well, <IN> will give you an
    undefined value. This will also happen if the open() call failed.


    --
    Andreas Kähäri
    Andreas Kahari Guest

  7. #6

    Default Re: regex help!

    On Sat, 13 Sep 2003 08:21:39 +0000 (UTC), Andreas Kahari
    <ak+usenet@freeshell.org> wrote:
    >In article <sfk5mvo202ccnl1b8tjv634fut1qvdo1nf@4ax.com>, Geoff Cox wrote:
    >[cut]
    >> Thanks for the help - following code works now but I get the error
    >> message "uninitialized value in string ne at ... the line with a **
    >> below - do you knwo why?
    >[cut]
    >> open (IN, "$_");
    >> my $line = <IN>;
    >> ** while ($line ne "") {
    >> if ($line =~ /Mailto:(.*?)"/i) {
    >> print OUT ("$1 \n");
    >[cut]
    >
    >
    >What happens at the end of a file? Well, <IN> will give you an
    >undefined value. This will also happen if the open() call failed.
    Andreas,

    ah! well the open call works so must be the end of file part - is
    there a better way than using while ($line ne "" ) ? eof?

    Geoff

    Geoff Cox Guest

  8. #7

    Default Re: regex help!

    In article <28l5mvovmurpbjk33goktp83ee2iv9e06f@4ax.com>, Geoff Cox wrote:
    > On Sat, 13 Sep 2003 08:21:39 +0000 (UTC), Andreas Kahari
    ><ak+usenet@freeshell.org> wrote:
    >>In article <sfk5mvo202ccnl1b8tjv634fut1qvdo1nf@4ax.com>, Geoff Cox wrote:
    [cut]
    >>> open (IN, "$_");
    >>> my $line = <IN>;
    >>> ** while ($line ne "") {
    >>> if ($line =~ /Mailto:(.*?)"/i) {
    >>> print OUT ("$1 \n");
    >>[cut]
    >>
    >>
    >>What happens at the end of a file? Well, <IN> will give you an
    >>undefined value. This will also happen if the open() call failed.
    >
    > Andreas,
    >
    > ah! well the open call works so must be the end of file part - is
    > there a better way than using while ($line ne "" ) ? eof?
    Yes, a much much better way:

    while(defined($line = <IN>)) {
    ... code ...
    }

    And personally I would say

    open(IN, $_) or die "Failed in open(): $!";


    Cheers,
    Andreas

    --
    Andreas Kähäri
    Andreas Kahari Guest

  9. #8

    Default Re: regex help!

    On Sat, 13 Sep 2003 08:39:03 +0000 (UTC), Andreas Kahari
    <ak+usenet@freeshell.org> wrote:
    >Yes, a much much better way:
    >
    > while(defined($line = <IN>)) {
    > ... code ...
    > }
    >
    >And personally I would say
    >
    > open(IN, $_) or die "Failed in open(): $!";
    will use both - thanks!

    Geoff
    >
    >
    >Cheers,
    >Andreas
    Geoff Cox Guest

  10. #9

    Default Re: regex help!

    Geoff Cox <geoff.cox@blueyonder.co.uk> wrote:
    > On Sat, 13 Sep 2003 08:39:03 +0000 (UTC), Andreas Kahari
    ><ak+usenet@freeshell.org> wrote:
    >> while(defined($line = <IN>)) {

    I like this better:

    while ( my $line = <IN> ) {

    >> ... code ...
    >> }
    >>
    >>And personally I would say
    >>
    >> open(IN, $_) or die "Failed in open(): $!";
    >
    > will use both - thanks!

    If you read the docs for the function that you used, then you
    would have already known to check open()'s return value.

    (there is a general-purpose lesson there...)


    perldoc -f open

    Open returns nonzero upon success, the undefined value otherwise.
    ...
    When opening a file, it's usually a bad idea to continue normal execution
    if the request failed, so C<open> is frequently used in connection with
    C<die>. Even if C<die> won't do what you want (say, in a CGI script,
    where you want to make a nicely formatted error message (but there are
    modules that can help with that problem)) you should always check
    the return value from opening a file.

    --
    Tad McClellan SGML consulting
    [email]tadmc@augustmail.com[/email] Perl programming
    Fort Worth, Texas
    Tad McClellan Guest

  11. #10

    Default Re: regex help!

    -----BEGIN PGP SIGNED MESSAGE-----
    Hash: SHA1

    Geoff Cox <geoff.cox@blueyonder.co.uk> wrote in
    news:lle5mvgf13133a618g0s3h560bpi1c535e@4ax.com:
    > I am trying to extract email addresses from about 1000 htm files.
    >
    > So far am trying
    >
    > if ($line =~ /Mailto:(.*)"/ {
    > print OUT ("$1 \n");
    >
    > where the line is
    >
    > <a href="mailto:fred@aol.com"
    >
    > problem is with the " after the email address and the "greedy" regex
    > characteristic which finds other " further along the line ...
    >
    > can I stop at the first " mark?
    Change your thinking a bit. Instead of matching "Mailto:" followed by as
    many characters as possible followed by a quote, match "Mailto:" followed
    by as many non-quote characters as possible followed by a quote:

    if ($line =~ /Mailto:([^"]*)"/)

    Also consider making it case-insensitive with the i modifier.

    - --
    Eric
    $_ = reverse sort $ /. r , qw p ekca lre uJ reh
    ts p , map $ _. $ " , qw e p h tona e and print

    -----BEGIN PGP SIGNATURE-----
    Version: PGPfreeware 7.0.3 for non-commercial use <http://www.pgp.com>

    iQA/AwUBP2MoO2PeouIeTNHoEQIdtACgxV2WliWoH07gZaS39JHGdb 1q+wAAn1f6
    oXom0J4O85KppYwOysICYuZs
    =yU+G
    -----END PGP SIGNATURE-----
    Eric J. Roode Guest

  12. #11

    Default Re: regex help!

    On Sat, 13 Sep 2003 09:22:06 -0500, "Eric J. Roode"
    <REMOVEsdnCAPS@comcast.net> wrote:

    >Change your thinking a bit. Instead of matching "Mailto:" followed by as
    >many characters as possible followed by a quote, match "Mailto:" followed
    >by as many non-quote characters as possible followed by a quote:
    >
    > if ($line =~ /Mailto:([^"]*)"/)
    Thanks Eric - will give it a try...

    Cheers

    Geoff
    >
    >Also consider making it case-insensitive with the i modifier.
    Geoff Cox Guest

Posting Permissions

  • You may not post new threads
  • You may post replies
  • You may not post attachments
  • You may not edit your posts

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139