Matching invalid characters in a URL

Ask a Question related to PERL Beginners, Design and Development.

  1. #1

    Default Matching invalid characters in a URL

    I'm trying to throw out URLs with any invalid characters in them, like
    '@". According to [url]http://www.ietf.org/rfc/rfc1738.txt[/url] :
    Thus, only alphanumerics, the special characters "$-_.+!*'(),", and
    reserved characters used for their reserved purposes may be used
    unencoded within a URL.

    I'd like to throw out a URL like
    'http://jncicancerspectrum.oupjournals.org/cgi/content/full/jnci;91/3/252'
    (even though this one works perfectly fine. Go figure.). I've tried:
    if ($url =~ /^[^A-Za-z0-9$-_.+!*'(),]+$/) { #if there are any
    invalid URL characters in the string
    # Remember, special
    regex characters lose their meaning inside []
    print "Invalid character in URL at line $.: $url\n";
    next;
    }

    According to my Camel, special regex characters are supposed to lose
    their special functioning inside []. Yet, that obviously isn't true for
    '-' used to separate the start and end of a range. I thought the fourth
    '-' at '$-' was probably indicating a range, so I tried to escape it by
    preceding it with a backslash or '\Q' but both gave strange errors about
    uninitiated strings in concatenations.

    Any suggestions? Thanks for your help and thoughts.

    -Kevin Zembower

    -----
    E. Kevin Zembower
    Unix Administrator
    Johns Hopkins University/Center for Communications Programs
    111 Market Place, Suite 310
    Baltimore, MD 21202
    410-659-6139
    Kevin Zembower Guest

  2. Similar Questions and Discussions

    1. error invalid characters in my cfquery?
      The list. What is that datatype your passing the list to? Is it an numeric based or string? If it is a string (char, varchar, ntext, nvarchar,...
    2. Using Invalid Characters
      I am using MS SQL for my ASP application and have several 'comments' fields. Obviously in these fields users are going to enter invalid characters...
    3. #25405 [Opn->Bgs]: The session id contains invalid characters
      ID: 25405 Updated by: iliaa@php.net Reported By: pop501 at hotmail dot com -Status: Open +Status: ...
    4. [PHP] Invalid Characters, XML...
      Here's what I have been using. $trans= array("'" => "&#39;", "'" => "&#39;",">" => "&#62;", "<" => "&#60;", "&" => "&#38;","-" => "&#45;", "°" => "&#176;", "±" => "&#177;", "-"...
    5. Invalid Characters, XML...
      Im using PHP to write to XML files, but I am having some problems. A lot of users are cutting and pasting content from text editors like word,...
  3. #2

    Default Re: Matching invalid characters in a URL

    > Any suggestions? Thanks for your help and thoughts.

    It is much easier to define the set all chars must be in then not. Use
    the =! which is the complement of all charachters matched by =~.
    Alternatively, I believe there is a c option you can use.

    -Dan

    Dan Anderson Guest

  4. #3

    Default Re: Matching invalid characters in a URL


    > > Any suggestions? Thanks for your help and thoughts.
    >
    > It is much easier to define the set all chars must be in then not. Use
    > the =! which is the complement of all charachters matched by =~.
    > Alternatively, I believe there is a c option you can use.
    >
    > -Dan
    That (I presume) should be !~ instead of != to complement =~ as opposed
    to ==. When trying to include a dash in a character class (and not make
    it a range), [], place it as the first character in the class, when
    including a carat ^ do NOT place it as the first character (as that
    negates the class).

    The other problem I would see is in more complex URLs, for instance @
    can be used to separate the authentication portion of a URL from the
    rest, a colon can indicate a port, and your example where semi-colons
    can be used to separate key/value pairs in the query string. You can
    likely catch 99% of bad URLs, just depends on how important that other
    1% is....

    Good luck,

    [url]http://danconia.org[/url]
    Wiggins D Anconia Guest

  5. #4

    Default Re: Matching invalid characters in a URL


    > I'm trying to throw out URLs with any invalid characters in them, like
    > '@". According to [url]http://www.ietf.org/rfc/rfc1738.txt[/url] :
    > Thus, only alphanumerics, the special characters "$-_.+!*'(),", and
    > reserved characters used for their reserved purposes may be used
    > unencoded within a URL.
    >
    > I'd like to throw out a URL like
    > 'http://jncicancerspectrum.oupjournals.org/cgi/content/full/jnci;91/3/252'
    > (even though this one works perfectly fine. Go figure.). I've tried:
    > if ($url =~ /^[^A-Za-z0-9$-_.+!*'(),]+$/) { #if there are any
    > invalid URL characters in the string
    > # Remember, special
    > regex characters lose their meaning inside []
    > print "Invalid character in URL at line $.: $url\n";
    > next;
    > }
    >
    > According to my Camel, special regex characters are supposed to lose
    > their special functioning inside []. Yet, that obviously isn't true for
    > '-' used to separate the start and end of a range. I thought the fourth
    > '-' at '$-' was probably indicating a range, so I tried to escape it by
    > preceding it with a backslash or '\Q' but both gave strange errors about
    > uninitiated strings in concatenations.
    >
    > Any suggestions? Thanks for your help and thoughts.
    >
    Did you mean to leave out those characters the RFC mentions are reserved
    for some schemes,

    "The characters ";", "/", "?", ":", "@", "=" and "&" are the characters
    which may be reserved for special meaning within a scheme."

    They should be in the class as well, since you are negating it right?
    Just trying to understand completely so I don't throw you off with any
    dumb remarks...

    [url]http://danconia.org[/url]

    Wiggins D Anconia Guest

  6. #5

    Default Re: Matching invalid characters in a URL

    On Fri, 2004-01-09 at 16:54, Wiggins d Anconia wrote:
    > > > Any suggestions? Thanks for your help and thoughts.
    > >
    > > It is much easier to define the set all chars must be in then not. Use
    > > the =! which is the complement of all charachters matched by =~.
    > > Alternatively, I believe there is a c option you can use.
    > >
    > > -Dan
    >
    > That (I presume) should be !~ instead of != to complement =~ as opposed
    > to ==. When trying to include a dash in a character class (and not make
    > it a range), [], place it as the first character in the class, when
    > including a carat ^ do NOT place it as the first character (as that
    > negates the class).
    Ooops... My apologies. I was typing too quick and without much
    caffeine. :-(

    -Dan

    Dan Anderson Guest

  7. #6

    Default Re: Matching invalid characters in a URL

    Thank you all for some first thoughts and clarifying questions.

    I'm trying to discard any URL with any character that is not an upper- or lower-case
    letter, digit, or the characters $-_.+!*'(), . I realize that some other characters can be
    used in special circumstances, but I don't have to allow for any of these in my program.

    I thought that my perl statement:
    if ($url =~ /^[^A-Za-z0-9$-_.+!*'(),]+$/) { #if there are any invalid URL characters in the string
    # Remember, special regex characters lose their meaning inside []
    print "Invalid character in URL at line $.: $url\n";
    next;
    }
    is saying:
    if the variable $url contains any characters not in the set [A-Za-z0-9$-_.+!*'(),]+$/), print "Invalid ..."

    So, I think I need help in two areas; Do I have my logic backwards because I'm trying to match any
    character in a variable, and, How do I write the match statement to do what I want.

    Thanks, again, for all your help and suggestions.

    -Kevin
    >>> Wiggins d Anconia <wiggins@danconia.org> 01/09/04 05:01PM >>>
    > I'm trying to throw out URLs with any invalid characters in them, like
    > '@". According to [url]http://www.ietf.org/rfc/rfc1738.txt[/url] :
    > Thus, only alphanumerics, the special characters "$-_.+!*'(),", and
    > reserved characters used for their reserved purposes may be used
    > unencoded within a URL.
    >
    > I'd like to throw out a URL like
    > 'http://jncicancerspectrum.oupjournals.org/cgi/content/full/jnci;91/3/252'
    > (even though this one works perfectly fine. Go figure.). I've tried:
    > if ($url =~ /^[^A-Za-z0-9$-_.+!*'(),]+$/) { #if there are any
    > invalid URL characters in the string
    > # Remember, special
    > regex characters lose their meaning inside []
    > print "Invalid character in URL at line $.: $url\n";
    > next;
    > }
    >
    > According to my Camel, special regex characters are supposed to lose
    > their special functioning inside []. Yet, that obviously isn't true for
    > '-' used to separate the start and end of a range. I thought the fourth
    > '-' at '$-' was probably indicating a range, so I tried to escape it by
    > preceding it with a backslash or '\Q' but both gave strange errors about
    > uninitiated strings in concatenations.
    >
    > Any suggestions? Thanks for your help and thoughts.
    >
    Did you mean to leave out those characters the RFC mentions are reserved
    for some schemes,

    "The characters ";", "/", "?", ":", "@", "=" and "&" are the characters
    which may be reserved for special meaning within a scheme."

    They should be in the class as well, since you are negating it right?
    Just trying to understand completely so I don't throw you off with any
    dumb remarks...

    [url]http://danconia.org[/url]


    Kevin Zembower Guest

  8. #7

    Default RE: Matching invalid characters in a URL

    KEVIN ZEMBOWER <KZEMBOWE@jhuccp.org]> wrote:
    :
    : I'm trying to discard any URL with any character that is not
    : an upper- or lower-case letter, digit, or the characters
    : $-_.+!*'(), . I realize that some other characters can be
    : used in special circumstances, but I don't have to allow for
    : any of these in my program.
    :
    : I thought that my perl statement:
    : if ($url =~ /^[^A-Za-z0-9$-_.+!*'(),]+$/) {
    : print "Invalid character in URL at line $.: $url\n";
    : next;
    : }
    : is saying:
    : if the variable $url contains any characters not in the set
    : [A-Za-z0-9$-_.+!*'(),]+$/), print "Invalid ..."

    No. It is saying if ALL characters are invalid ...

    Ignore the character class and look at the rest. There
    is no room for a valid character:

    /
    ^ # start at the beginning of the string
    [^A-Za-z0-9$-_.+!*'(),]+
    $ # end at the end of the string
    /

    Your anchors are dragging you down. You want to find the
    first invalid character. After that it doesn't matter. This
    should be fine.

    /[^A-Za-z0-9$-_.+!*'(),]/


    HTH,

    Charles K. Clarkson
    --
    Head Bottle Washer,
    Clarkson Energy Homes, Inc.
    Mobile Home Specialists
    254 968-8328







    Charles K. Clarkson Guest

  9. #8

    Default Re: Matching invalid characters in a URL

    Kevin Zembower wrote:
    >
    > Thank you all for some first thoughts and clarifying questions.
    >
    > I'm trying to discard any URL with any character that is not an upper- or lower-case
    > letter, digit, or the characters $-_.+!*'(), . I realize that some other characters can be
    > used in special circumstances, but I don't have to allow for any of these in my program.
    >
    > I thought that my perl statement:
    > if ($url =~ /^[^A-Za-z0-9$-_.+!*'(),]+$/) { #if there are any invalid URL characters in the string
    > # Remember, special regex characters lose their meaning inside []
    > print "Invalid character in URL at line $.: $url\n";
    > next;
    > }
    > is saying:
    > if the variable $url contains any characters not in the set [A-Za-z0-9$-_.+!*'(),]+$/), print "Invalid ..."
    >
    > So, I think I need help in two areas; Do I have my logic backwards because I'm trying to match any
    > character in a variable, and, How do I write the match statement to do what I want.
    Hi Kevin.

    Take note of Charles' points, but also note that Perl is trying to expand
    the built-in variable $- into your regex. This is almost certainly zero
    unless you're using formats, so you're including the digit zero for a second
    time instead of dollar and dash.

    If you escape the dollar and code

    [A-Za-z0-9\$-_.+!*'(),]

    instead, then your class will include all characters from dollar up to
    underscore. So you need to escape both dollar and dash:

    if ($url =~ /[A-Za-z0-9\$\-_.+!*'(),]/) {
    # Remember MOST special regex characters lose their meaning inside []
    print "Invalid character in URL at line $.: $url\n";
    next;
    }

    HTH,

    Rob


    Rob Dixon Guest

Posting Permissions

  • You may not post new threads
  • You may post replies
  • You may not post attachments
  • You may not edit your posts

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139