Regex Head Scratcher

Ask a Question related to PERL Miscellaneous, Design and Development.

  1. #1

    Default Regex Head Scratcher

    All:

    I've run into a dark alley with a regex and could use some assistance.

    I have a list of keywords as keys in a hash (eg. %keywords) and I am looking
    at each key to see if it can be matched in a string.

    For example, let's say the current $keyword is "bush" and the $phrase is:

    "George Bush arrived in San Francisco today from a trip to Russia."

    Quite simply, the expression $phrase =~ /$keyword/is; would be a positive
    match on "Bush". If $keyword becomes "U.S." the result not too
    surprisingly become positive as well ("us" in Russia). Well, this is
    undesirable so I change the expression to $phrase =~ /\Q$keyword/is; and
    now there is no longer a match on "U.S." which is the desired result.

    This is still insufficient, however, because if the $keyword becomes "Cisco"
    there will be a positive match on "Francisco". Therefore, I'll make some
    accommodation for leading and trailing characters in the expression in
    order to isolate the keywords, thus:

    $phrase =~ /(?:\s|'|"|\()$keyword(?:,|'|"|!|\.|\s|\))/is;

    OK, so this works just fine but now I'm back to the "U.S." problem again and
    I'd like to stick the \Q back in but clearly this will render the new parts
    of the expression useless.

    The question, therefore is how can I treat $keyword as quoted in this
    context?

    One thing I did try was to assign qq($keyword) to a new variable and search
    on that variable, but this didn't seem to have an impact.

    Thank you for your time and thoughts.

    Moose

    mooseshoes Guest

  2. Similar Questions and Discussions

    1. head do
      :confused; Can neone tell me why this isn't working in mx..........it shows this ! ;" ?> got a feeling the escaping isn't right but it's so...
    2. Head-scratcher text problem
      This is not a huge deal, but In CS, if you select some text and change the fill to none, then copy and paste it, the fill in the new text always...
    3. Write into <HEAD></HEAD> section?
      Is it possible to write into HEAD section, for example to write out a LINK tag?
    4. [PHP-DEV] [PHP-CVS] cvs: php-src / NEWS /ext/standard basic_functions.c head.c head.h
      > As Andi might say: "Why not call this http_headers()?" :) As you can probably guess, my answer to your first question is your second...
    5. Injecting code into the <head></head> section
      Hi All, I have a web user control that, among other things, provides Print this page, and Email this page functionality I have this script that...
  3. #2

    Default Re: Regex Head Scratcher

    Please use this as the expression in question in the previous post:

    $phrase =~ /(?:\s|'|"|\()?$keyword(?:,|'|"|!|\.|\s|\))?/is;


    mooseshoes wrote:
    > All:
    >
    > I've run into a dark alley with a regex and could use some assistance.
    >
    > I have a list of keywords as keys in a hash (eg. %keywords) and I am
    > looking at each key to see if it can be matched in a string.
    >
    > For example, let's say the current $keyword is "bush" and the $phrase is:
    >
    > "George Bush arrived in San Francisco today from a trip to Russia."
    >
    > Quite simply, the expression $phrase =~ /$keyword/is; would be a positive
    > match on "Bush". If $keyword becomes "U.S." the result not too
    > surprisingly become positive as well ("us" in Russia). Well, this is
    > undesirable so I change the expression to $phrase =~ /\Q$keyword/is; and
    > now there is no longer a match on "U.S." which is the desired result.
    >
    > This is still insufficient, however, because if the $keyword becomes
    > "Cisco"
    > there will be a positive match on "Francisco". Therefore, I'll make some
    > accommodation for leading and trailing characters in the expression in
    > order to isolate the keywords, thus:
    >
    > $phrase =~ /(?:\s|'|"|\()$keyword(?:,|'|"|!|\.|\s|\))/is;
    >
    > OK, so this works just fine but now I'm back to the "U.S." problem again
    > and I'd like to stick the \Q back in but clearly this will render the new
    > parts of the expression useless.
    >
    > The question, therefore is how can I treat $keyword as quoted in this
    > context?
    >
    > One thing I did try was to assign qq($keyword) to a new variable and
    > search on that variable, but this didn't seem to have an impact.
    >
    > Thank you for your time and thoughts.
    >
    > Moose
    mooseshoes Guest

  4. #3

    Default Re: Regex Head Scratcher

    mooseshoes wrote:
    >
    > I've run into a dark alley with a regex and could use some assistance.
    >
    > I have a list of keywords as keys in a hash (eg. %keywords) and I am looking
    > at each key to see if it can be matched in a string.
    >
    > For example, let's say the current $keyword is "bush" and the $phrase is:
    >
    > "George Bush arrived in San Francisco today from a trip to Russia."
    >
    > Quite simply, the expression $phrase =~ /$keyword/is; would be a positive
    > match on "Bush". If $keyword becomes "U.S." the result not too
    > surprisingly become positive as well ("us" in Russia). Well, this is
    > undesirable so I change the expression to $phrase =~ /\Q$keyword/is; and
    > now there is no longer a match on "U.S." which is the desired result.
    >
    > This is still insufficient, however, because if the $keyword becomes "Cisco"
    > there will be a positive match on "Francisco". Therefore, I'll make some
    > accommodation for leading and trailing characters in the expression in
    > order to isolate the keywords, thus:
    >
    > $phrase =~ /(?:\s|'|"|\()$keyword(?:,|'|"|!|\.|\s|\))/is;
    You probably want to use the \b word boundary zero width assertion.

    $phrase =~ /\b\Q$keyword\E\b/is;


    John
    --
    use Perl;
    program
    fulfillment
    John W. Krahn Guest

  5. #4

    Default Re: Regex Head Scratcher

    On 10 Aug 2003 21:53:26 GMT, mooseshoes <mooseshoes@gmx.net> wrote:
    > All:
    >
    > I've run into a dark alley with a regex and could use some assistance.
    >
    > I have a list of keywords as keys in a hash (eg. %keywords) and I am looking
    > at each key to see if it can be matched in a string.
    >
    > For example, let's say the current $keyword is "bush" and the $phrase is:
    >
    > "George Bush arrived in San Francisco today from a trip to Russia."
    >
    > Quite simply, the expression $phrase =~ /$keyword/is; would be a positive
    > match on "Bush". If $keyword becomes "U.S." the result not too
    > surprisingly become positive as well ("us" in Russia). Well, this is
    > undesirable so I change the expression to $phrase =~ /\Q$keyword/is; and
    > now there is no longer a match on "U.S." which is the desired result.
    Small nitpick it "U.S." does not match the "us" in Russia, it matches
    the "ussi" in Russia (if it was going to match "us" it'd grab the one
    in Bush :).
    >
    > This is still insufficient, however, because if the $keyword becomes "Cisco"
    > there will be a positive match on "Francisco". Therefore, I'll make some
    > accommodation for leading and trailing characters in the expression in
    > order to isolate the keywords, thus:
    >
    > $phrase =~ /(?:\s|'|"|\()$keyword(?:,|'|"|!|\.|\s|\))/is;
    That isn't going to work, take for example your $phrase above and
    the keyword "george".

    Use \b, it in all likelyhood does what you actually want.

    See "perldoc perlre" for details on \b.

    Also, the /s modifier is useless, since all it does is change "." to
    match all characters (instead of all characters bar "\n"). Since you
    don't have any (unescaped) "."s in your regex /s just serves to confuse
    the reader of the regex (who will look for a dot).

    Again, "perldoc perlre" for details on /s.
    >
    > OK, so this works just fine but now I'm back to the "U.S." problem again and
    > I'd like to stick the \Q back in but clearly this will render the new parts
    > of the expression useless.
    There is \E as well as \Q.

    See "perldoc perlre" again, for details.
    > The question, therefore is how can I treat $keyword as quoted in this
    > context?
    $phrase=~/\b\Q$keyword\E\b/i;

    You could also use:

    $keyword = quotemeta $keyword;

    $phrase=~/\b$keyword\b/i;

    Though you might want to use a different name if you need the original
    later.

    > One thing I did try was to assign qq($keyword) to a new variable and search
    > on that variable, but this didn't seem to have an impact.
    Why would it?

    "$foo = qq($bar)" is the same as "$foo = $bar" if $bar is a string already.

    Programming by guess is not efficient...

    --
    Sam Holden

    Sam Holden Guest

  6. #5

    Default Re: Regex Head Scratcher

    On 10 Aug 2003 22:05:12 GMT, mooseshoes <mooseshoes@gmx.net> wrote:
    > Please use this as the expression in question in the previous post:
    >
    > $phrase =~ /(?:\s|'|"|\()?$keyword(?:,|'|"|!|\.|\s|\))?/is;
    That matches exactly the same set of strings as

    $phrase =~ /$keyword/i;

    does. Putting element which can match the empty string at either
    end of regex will not change the strings it matches (it may cause it
    to capture different parts of the string, but you aren't doing
    any capturing).

    [snip "previous post"]

    --
    Sam Holden

    Sam Holden Guest

  7. #6

    Default Re: Regex Head Scratcher

    Sam (and John if you're listening):

    Thank you for your helpful remarks.

    You both came up with the same solution (great minds think alike?) and I now
    fully understand both the errors of my ways and why the proposed solution
    is the best approach. Despite the fact that perlretut and Wall's bible go
    to bed with me each night, discovering perlre will be a very helpful
    resource as perlretut is light on both \b and \E.

    Regarding /s, I didn't mention earlier that the phrases were actually
    sub-phrases of HTML pages converted to text and having had trouble with
    line breaks in previous experiences with these strings I had left in the
    /s, but potentially I can remove it at this point.

    And yes, I am occasionally guilty of what I call programming "flailing"
    which is a bad practice of inserting code with only a vague notion of what
    the result may be. I think I can attribute this to spending many years in
    the marketing departments of large companies. ;) I generally do catch
    myself, however, as I do prefer to know what is going on.

    Cheers,

    Moose




    Sam Holden wrote:
    > On 10 Aug 2003 21:53:26 GMT, mooseshoes <mooseshoes@gmx.net> wrote:
    >> All:
    >>
    >> I've run into a dark alley with a regex and could use some assistance.
    >>
    >> I have a list of keywords as keys in a hash (eg. %keywords) and I am
    >> looking at each key to see if it can be matched in a string.
    >>
    >> For example, let's say the current $keyword is "bush" and the $phrase is:
    >>
    >> "George Bush arrived in San Francisco today from a trip to Russia."
    >>
    >> Quite simply, the expression $phrase =~ /$keyword/is; would be a positive
    >> match on "Bush". If $keyword becomes "U.S." the result not too
    >> surprisingly become positive as well ("us" in Russia). Well, this is
    >> undesirable so I change the expression to $phrase =~ /\Q$keyword/is; and
    >> now there is no longer a match on "U.S." which is the desired result.
    >
    > Small nitpick it "U.S." does not match the "us" in Russia, it matches
    > the "ussi" in Russia (if it was going to match "us" it'd grab the one
    > in Bush :).
    >
    >>
    >> This is still insufficient, however, because if the $keyword becomes
    >> "Cisco"
    >> there will be a positive match on "Francisco". Therefore, I'll make some
    >> accommodation for leading and trailing characters in the expression in
    >> order to isolate the keywords, thus:
    >>
    >> $phrase =~ /(?:\s|'|"|\()$keyword(?:,|'|"|!|\.|\s|\))/is;
    >
    > That isn't going to work, take for example your $phrase above and
    > the keyword "george".
    >
    > Use \b, it in all likelyhood does what you actually want.
    >
    > See "perldoc perlre" for details on \b.
    >
    > Also, the /s modifier is useless, since all it does is change "." to
    > match all characters (instead of all characters bar "\n"). Since you
    > don't have any (unescaped) "."s in your regex /s just serves to confuse
    > the reader of the regex (who will look for a dot).
    >
    > Again, "perldoc perlre" for details on /s.
    >
    >>
    >> OK, so this works just fine but now I'm back to the "U.S." problem again
    >> and I'd like to stick the \Q back in but clearly this will render the new
    >> parts of the expression useless.
    >
    > There is \E as well as \Q.
    >
    > See "perldoc perlre" again, for details.
    >
    >> The question, therefore is how can I treat $keyword as quoted in this
    >> context?
    >
    > $phrase=~/\b\Q$keyword\E\b/i;
    >
    > You could also use:
    >
    > $keyword = quotemeta $keyword;
    >
    > $phrase=~/\b$keyword\b/i;
    >
    > Though you might want to use a different name if you need the original
    > later.
    >
    >
    >> One thing I did try was to assign qq($keyword) to a new variable and
    >> search on that variable, but this didn't seem to have an impact.
    >
    > Why would it?
    >
    > "$foo = qq($bar)" is the same as "$foo = $bar" if $bar is a string
    > already.
    >
    > Programming by guess is not efficient...
    >
    mooseshoes Guest

  8. #7

    Default Re: Regex Head Scratcher

    On 11 Aug 2003 00:14:20 GMT, mooseshoes <mooseshoes@gmx.net> wrote:
    > Sam (and John if you're listening):
    >
    > Thank you for your helpful remarks.
    >
    > You both came up with the same solution (great minds think alike?) and I now
    > fully understand both the errors of my ways and why the proposed solution
    > is the best approach. Despite the fact that perlretut and Wall's bible go
    > to bed with me each night, discovering perlre will be a very helpful
    > resource as perlretut is light on both \b and \E.
    More the normal way of performing such a match than great minds...

    Of course if you haven't come across \b and \E, you aren't going to
    know the "normal" way.
    >
    > And yes, I am occasionally guilty of what I call programming "flailing"
    > which is a bad practice of inserting code with only a vague notion of what
    > the result may be. I think I can attribute this to spending many years in
    > the marketing departments of large companies. ;) I generally do catch
    > myself, however, as I do prefer to know what is going on.
    I think everyone "flails" at times, though with perl the documentation is
    of an amazingly high quality and hence there is little need to. Trial and
    error can on occassions be a useful learning method - as long as you
    take the time to learn why the things which failed failed, and why the
    things which worked worked.

    [snip quote of entire article]

    You really shouldn't do that. Many of the most experienced and helpful
    people here don't like it, and ignore posts from people who keep doing
    it. Taking the time to trim the quoted text to only what is necessary
    to give context to the reader will make your life easier later.

    See the Posting Guidelines which are posted here frequently or on the web
    at [url]http://mail.augustmail.com/~tadmc/clpmisc/clpmisc_guidelines.html[/url] for
    some other useful tips to making the most out of this newsgroup.

    --
    Sam Holden
    Sam Holden Guest

  9. #8

    Default Re: Regex Head Scratcher

    mooseshoes wrote:
    > All:
    >
    > I've run into a dark alley with a regex and could use some assistance.
    >
    > I have a list of keywords as keys in a hash (eg. %keywords) and I am looking
    > at each key to see if it can be matched in a string.
    <SNIP>

    Other posts dealt with the issues you were having with your regular
    expression.

    However, one thing to keep in mind is that your list of key words
    gets longer, using a regex to match against them in the string
    gets less and less efficient. In other words, the regex approach
    does not seem to scale well IF you are searching for multiple words.

    It usually only takes a limited number of key words (two in some cases)
    to iterate through before it becomes more effecient to split the string
    and then check the resulting list against the keywords hash.

    Something like:

    my @list = split /\s/, $string;

    for( @list ){
    $keywords{$_} or next;
    print "found one ( $_ )\n";
    }

    You might want to benchmark whatever you come up with and see for
    yourself.

    What I've found is that variations of the above will benchmark about
    the same and regex time increases dramatically as the number of keys
    increases.

    Just a thought....

    s.

    Steve May Guest

Posting Permissions

  • You may not post new threads
  • You may post replies
  • You may not post attachments
  • You may not edit your posts

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139