Need help with a regex

Ask a Question related to PERL Beginners, Design and Development.

  1. #1

    Default Need help with a regex

    This newbie needs help with a regex. Here's what the data from a text
    file looks like. There's no delimiter and the fields aren't evenly spaced
    apart.

    apples San Antonio Fruit
    oranges Sacramento Fruit
    pineapples Honolulu Fruit
    lemons Corona del Rey Fruit

    Basically, I want to put the city names into an array. The first field,
    the fruit name, is always one word with no spaces.

    So, I would guess that the regex needs to grab everything after the first
    word and before the beginning of Fruit. Then strip out all the spaces.

    Or grab the beginning of the second word until the beginning of Fruit.
    Then strip out the spaces after the city name.

    Anyone know how to do that ?

    I did recently buy the Mastering Regular Expressions, 2nd Edition book.
    I've only read a little, but I've found the book to be very readable. If
    I only had the time to really spend with it ! So much to learn, so little
    time.

    Thanks in advance for any help.
    Stuart Clemons Guest

  2. Similar Questions and Discussions

    1. IP regex?
      Gareth Glaccum wrote: How about using m/^(\d+)\.(\d+)\.(\d+)\.(\d+)$/ and testing $1 - $4 for compliance? Much cleaner. -- Cheers,
    2. Regex help
      I'd like to replace any html tags containing "< >" with a space. For example, <TR VALIGN=TOP>, I'd like to replace that with a space. Is there a...
    3. REGEX help pls
      in the regex buddy they are explaining: "Be careful when using the negated shorthands inside square brackets. is not the same as . The latter...
    4. Regex..
      Could some good samaritan help me out with this pls... I am trying to find a regular expression for the below string.. ExchangeName =...
    5. Need help with regex
      > I have a directory of files that I want to move to another directory.
  3. #2

    Default Re: Need help with a regex

    On Fri, Jan 23, 2004 at 12:01:13AM -0500, [email]stuart_clemons@us.ibm.com[/email] (stuart_clemons@us.ibm.com) wrote:
    > This newbie needs help with a regex. Here's what the data from a text
    > file looks like. There's no delimiter and the fields aren't evenly spaced
    > apart.
    >
    > apples San Antonio Fruit
    > oranges Sacramento Fruit
    > pineapples Honolulu Fruit
    > lemons Corona del Rey Fruit
    >
    > Basically, I want to put the city names into an array. The first field,
    > the fruit name, is always one word with no spaces.
    >
    > So, I would guess that the regex needs to grab everything after the first
    > word and before the beginning of Fruit. Then strip out all the spaces.
    >
    > Or grab the beginning of the second word until the beginning of Fruit.
    > Then strip out the spaces after the city name.
    >
    > Anyone know how to do that ?
    I'm not that experienced with Perl but here is my stab at it.

    #!/usr/bin/perl
    use warnings;
    use strict;

    while (<DATA>) {
    if ($_ =~ /^\w+\s+(.+)\s+F\w+$/) {
    push (my @array, $1);
    print "@array\n";
    }
    }

    __DATA__
    apples San Antonio Fruit
    oranges Sacramento Fruit
    pineapples Honolulu Fruit
    lemons Corona del Rey Fruit
    __END__

    Not the greatest regex but it works. I'm sure you will get better
    solutions.

    /^ = Beginning of line
    \w+ = one or more word characters
    \s+ = one or more white spaces
    (.+) = any character one or more times grouped by (), contains "city"
    \s+F = white space up to "F"
    \w+$/ = one or more word characters up to end of line.

    push loads @array with "$1" which snags what is in (.+) from the regex.
    hth,
    Kent

    Kenton Brede Guest

  4. #3

    Default RE: Need help with a regex

    Thanks very much Tim. I just did a quick test on my real file and it
    worked perfectly.

    I definitely still have a lot to learn with both Perl and regex's, so I
    really appreciate the explanation as well. Though your script is very
    compact, I learned a lot from it. Such as how you initialized the array.
    I have a couple of scripts where I get warnings about either improper or
    uninitialized arrays, or something to that effect. I tried to fix those,
    but was unsuccessful. Those scripts produced the output I wanted, but the
    warnings are bothersome. I'll take another look at those scripts to see if
    initializing using "my @arrayname = ( );" will help.

    Also, the "push" structure for adding elements to the array was very
    helpful. I have a way to do it, and while my way works and is somewhat
    creative, my way is actually really embarrassingly bad and inefficient
    coding. So, I learned from that too.

    It's funny how all this stuff is in the Perl books that I've been reading,
    but once I need to solve a problem, the exact right way to do it doesn't
    come to me. I can spend hours trying to do some pretty simple stuff. I
    can usually come up with a solution, but I know that it's not usually
    efficient nor is it really close to the right way to do it. But, the
    good news is, if I think about where my Perl skills are today compared to
    a month ago, I'm making progress !

    Anyway, sorry for being so looong winded. The bottom line is that I
    really appreciate your help.




    "Tim Johnson" <tjohnson@sandisk.com>
    01/23/2004 01:32 AM

    To
    "Tim Johnson" <tjohnson@sandisk.com>, <stuart_clemons@us.ibm.com>,
    <beginners@perl.org>
    cc

    Subject
    RE: Need help with a regex






    Ooh. That's embarassing. I didn't pay close enough attention to the OP.
    Some of the inside matches contain spaces. My regex should have been:

    /^\S+\s+(.+)\s+/

    which would match:

    * the beginning of the line (^)
    * followed by one or more non-whitespace characters (\S+)
    * followed by one or more whitespace characters (\s+)
    * followed by one or more of any characters including
    whitespace (.+)
    * followed by one or more whitespace characters (\s+)

    because Perl will match the largest possible number of characters, the .+
    will match everything between the two outside spaces.

    -----Original Message-----
    From: Tim Johnson
    Sent: Thu 1/22/2004 9:31 PM
    To: [email]stuart_clemons@us.ibm.com[/email]; [email]beginners@perl.org[/email]
    Cc:
    Subject: RE: Need help with a regex



    Try this on for size:

    #####################
    use strict;
    use warnings;
    my @cities = ();
    open(INFILE,"myfile.txt") || die "Couldn't open
    myfile.txt for reading!\n";
    while(<INFILE>){
    $_ =~ /^\S+\s+(\S+)/;
    push @cities,$1;
    }
    #do something to @cities

    #####################

    which basically means to match:

    * the start of the line (^)
    * followed by one or more non-whitespace characters
    (\S+)
    * followed by one or more whitespace characters
    (\s+)
    * followed by one or more non-whitespace characters
    (\S+)

    the parentheses around the last non-whitespace match
    assign it to $1

    Note: Check out "perldoc perlre" for the man pages. It
    might be worth looking over real quick before you dig into the book.

    Or, for the quick and easy way without a regex, how bout:

    #############################

    use strict;
    use warnings;
    my @cities;
    open(INFILE,"myfile.txt") || die "Could not open
    myfile.txt for reading!\n";
    while(<INFILE>){
    push @cities,(split /\s+/,$_)[1];
    }

    #############################

    which does a split on the line and returns the second
    element of the resulting list and assigns it to @cities.

    -----Original Message-----
    From: [email]stuart_clemons@us.ibm.com[/email]
    [mailto:stuart_clemons@us.ibm.com]
    Sent: Thu 1/22/2004 9:01 PM
    To: [email]beginners@perl.org[/email]
    Cc:
    Subject: Need help with a regex



    This newbie needs help with a regex. Here's what
    the data from a text
    file looks like. There's no delimiter and the
    fields aren't evenly spaced
    apart.

    apples San Antonio Fruit
    oranges Sacramento Fruit
    pineapples Honolulu Fruit
    lemons Corona del Rey Fruit

    Basically, I want to put the city names into an
    array. The first field,
    the fruit name, is always one word with no
    spaces.







    Stuart Clemons Guest

  5. #4

    Default Re: Need help with a regex

    On Jan 23, [email]stuart_clemons@us.ibm.com[/email] said:
    >This newbie needs help with a regex. Here's what the data from a text
    >file looks like. There's no delimiter and the fields aren't evenly spaced
    >apart.
    >
    >apples San Antonio Fruit
    >oranges Sacramento Fruit
    >pineapples Honolulu Fruit
    >lemons Corona del Rey Fruit
    >
    >Basically, I want to put the city names into an array. The first field,
    >the fruit name, is always one word with no spaces.
    >
    >Anyone know how to do that ?
    Well, there are many ways. You could split the string on whitespace,
    remove the first and last elements, and join the others with spaces:

    for (@data) {
    my @fields = split;
    shift @fields;
    pop @fields;
    push @cities, "@fields"; # "@array" = join(" ", @array)
    }

    Or, you could use a regex that gets SPECIFICALLY what you want:

    for (@data) {
    push @cities, $1 if /^\S+\s+(\S+(?:\s+\S+)*)\s+\S+$/;
    }

    That regex might need a bit of explanation:

    m{
    ^ # the beginning of the string
    \S+ # one or more non-spaces
    \s+ # one or more spaces
    ( # capture to $1:
    \S+ # first word of the city name
    (?: \s+ \S+ )* # *ALL* remaining words
    )
    \s+ # one or more spaces
    \S+ # one or more non-spaces
    $ # the end of the string
    }x;

    What this does on a string like "peach Georgia fruit" is this: the first
    \S+\s+ matches "peach ". Then we capture "Georgia fruit" to $1. However,
    the REST of the regex still has to match, but it can't, so the (?:\s+\S+)*
    backtracks -- it gives up one of the chunks it matched, so $1 is only
    "Georgia". Then the last \s+\S+ can match " fruit".

    --
    Jeff "japhy" Pinyan [email]japhy@pobox.com[/email] [url]http://www.pobox.com/~japhy/[/url]
    RPI Acacia brother #734 [url]http://www.perlmonks.org/[/url] [url]http://www.cpan.org/[/url]
    <stu> what does y/// stand for? <tenderpuss> why, yansliterate of course.
    [ I'm looking for programming work. If you like my work, let me know. ]

    Jeff 'Japhy' Pinyan Guest

  6. #5

    Default Re: Need help with a regex

    Thanks Jeff. I hope to try this out later today. I thought I had the
    solution earlier this morning, but I ran into a problem. I hope this will
    solve it ! Thanks again.




    "Jeff 'japhy' Pinyan" <japhy@perlmonk.org>
    01/23/2004 10:34 AM
    Please respond to
    [email]japhy@pobox.com[/email]


    To
    [email]stuart_clemons@us.ibm.com[/email]
    cc
    [email]beginners@perl.org[/email]
    Subject
    Re: Need help with a regex






    On Jan 23, [email]stuart_clemons@us.ibm.com[/email] said:
    >This newbie needs help with a regex. Here's what the data from a text
    >file looks like. There's no delimiter and the fields aren't evenly spaced
    >apart.
    >
    >apples San Antonio Fruit
    >oranges Sacramento Fruit
    >pineapples Honolulu Fruit
    >lemons Corona del Rey Fruit
    >
    >Basically, I want to put the city names into an array. The first field,
    >the fruit name, is always one word with no spaces.
    >
    >Anyone know how to do that ?
    Well, there are many ways. You could split the string on whitespace,
    remove the first and last elements, and join the others with spaces:

    for (@data) {
    my @fields = split;
    shift @fields;
    pop @fields;
    push @cities, "@fields"; # "@array" = join(" ", @array)
    }

    Or, you could use a regex that gets SPECIFICALLY what you want:

    for (@data) {
    push @cities, $1 if /^\S+\s+(\S+(?:\s+\S+)*)\s+\S+$/;
    }

    That regex might need a bit of explanation:

    m{
    ^ # the beginning of the string
    \S+ # one or more non-spaces
    \s+ # one or more spaces
    ( # capture to $1:
    \S+ # first word of the city name
    (?: \s+ \S+ )* # *ALL* remaining words
    )
    \s+ # one or more spaces
    \S+ # one or more non-spaces
    $ # the end of the string
    }x;

    What this does on a string like "peach Georgia fruit" is this: the first
    \S+\s+ matches "peach ". Then we capture "Georgia fruit" to $1. However,
    the REST of the regex still has to match, but it can't, so the (?:\s+\S+)*
    backtracks -- it gives up one of the chunks it matched, so $1 is only
    "Georgia". Then the last \s+\S+ can match " fruit".

    --
    Jeff "japhy" Pinyan [email]japhy@pobox.com[/email] [url]http://www.pobox.com/~japhy/[/url]
    RPI Acacia brother #734 [url]http://www.perlmonks.org/[/url] [url]http://www.cpan.org/[/url]
    <stu> what does y/// stand for? <tenderpuss> why, yansliterate of course.
    [ I'm looking for programming work. If you like my work, let me know. ]



    Stuart Clemons Guest

  7. #6

    Default Re: Need help with a regex

    [email]stuart_clemons@us.ibm.com[/email] wrote:
    > Thanks Jeff. I hope to try this out later today. I thought I had the
    > solution earlier this morning, but I ran into a problem. I hope this will
    > solve it ! Thanks again.
    >
    > >apples San Antonio Fruit
    > >oranges Sacramento Fruit
    > >pineapples Honolulu Fruit
    > >lemons Corona del Rey Fruit
    > >
    > >Basically, I want to put the city names into an array. The first field,
    > >the fruit name, is always one word with no spaces.
    > >
    > >Anyone know how to do that ?
    >
    > Well, there are many ways. You could split the string on whitespace,
    > remove the first and last elements, and join the others with spaces:
    >
    > for (@data) {
    > my @fields = split;
    > shift @fields;
    > pop @fields;
    > push @cities, "@fields"; # "@array" = join(" ", @array)
    > }
    I'd vote for this one--almost. It does the right thing with positions,
    presuming that Stuart can count on the fruit type and class always being
    contained in a single token. The one thing I would do is to give the parts
    meaningful names. Unless he totally wants to discard the significant fruit
    name as well as the non-informaticve class desiganation "Fruit", he might as
    well preserve the information that he has available:

    foreach (@data) {
    my @fields = split;
    pop @fields; # only use void statements to get rid of garbage
    my $growing_location = {
    'fruit type' => shift @fields,
    'growing location' => join @fields
    }
    push @cities, $growing_location;
    }

    Okay, I don't know whether these really indicate growing locations, but I am
    assuming sanity here--that there is some articulable meaning to the
    juxtaposition. The identifiers in the code should communicate that meaning..
    Otherwise you are throwing information away, the antithesis of the
    programmer's purpose. Besides, clearly named variables are much easier to
    debug.

    Joseph



    R. Joseph Newton Guest

Posting Permissions

  • You may not post new threads
  • You may post replies
  • You may not post attachments
  • You may not edit your posts

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139