Ask a Question related to PERL Beginners, Design and Development.
-
Kevin Zembower #1
Matching invalid characters in a URL
I'm trying to throw out URLs with any invalid characters in them, like
'@". According to [url]http://www.ietf.org/rfc/rfc1738.txt[/url] :
Thus, only alphanumerics, the special characters "$-_.+!*'(),", and
reserved characters used for their reserved purposes may be used
unencoded within a URL.
I'd like to throw out a URL like
'http://jncicancerspectrum.oupjournals.org/cgi/content/full/jnci;91/3/252'
(even though this one works perfectly fine. Go figure.). I've tried:
if ($url =~ /^[^A-Za-z0-9$-_.+!*'(),]+$/) { #if there are any
invalid URL characters in the string
# Remember, special
regex characters lose their meaning inside []
print "Invalid character in URL at line $.: $url\n";
next;
}
According to my Camel, special regex characters are supposed to lose
their special functioning inside []. Yet, that obviously isn't true for
'-' used to separate the start and end of a range. I thought the fourth
'-' at '$-' was probably indicating a range, so I tried to escape it by
preceding it with a backslash or '\Q' but both gave strange errors about
uninitiated strings in concatenations.
Any suggestions? Thanks for your help and thoughts.
-Kevin Zembower
-----
E. Kevin Zembower
Unix Administrator
Johns Hopkins University/Center for Communications Programs
111 Market Place, Suite 310
Baltimore, MD 21202
410-659-6139
Kevin Zembower Guest
-
error invalid characters in my cfquery?
The list. What is that datatype your passing the list to? Is it an numeric based or string? If it is a string (char, varchar, ntext, nvarchar,... -
Using Invalid Characters
I am using MS SQL for my ASP application and have several 'comments' fields. Obviously in these fields users are going to enter invalid characters... -
#25405 [Opn->Bgs]: The session id contains invalid characters
ID: 25405 Updated by: iliaa@php.net Reported By: pop501 at hotmail dot com -Status: Open +Status: ... -
[PHP] Invalid Characters, XML...
Here's what I have been using. $trans= array("'" => "'", "'" => "'",">" => ">", "<" => "<", "&" => "&","-" => "-", "°" => "°", "±" => "±", "-"... -
Invalid Characters, XML...
Im using PHP to write to XML files, but I am having some problems. A lot of users are cutting and pasting content from text editors like word,... -
Dan Anderson #2
Re: Matching invalid characters in a URL
> Any suggestions? Thanks for your help and thoughts.
It is much easier to define the set all chars must be in then not. Use
the =! which is the complement of all charachters matched by =~.
Alternatively, I believe there is a c option you can use.
-Dan
Dan Anderson Guest
-
Wiggins D Anconia #3
Re: Matching invalid characters in a URL
That (I presume) should be !~ instead of != to complement =~ as opposed>> > Any suggestions? Thanks for your help and thoughts.
> It is much easier to define the set all chars must be in then not. Use
> the =! which is the complement of all charachters matched by =~.
> Alternatively, I believe there is a c option you can use.
>
> -Dan
to ==. When trying to include a dash in a character class (and not make
it a range), [], place it as the first character in the class, when
including a carat ^ do NOT place it as the first character (as that
negates the class).
The other problem I would see is in more complex URLs, for instance @
can be used to separate the authentication portion of a URL from the
rest, a colon can indicate a port, and your example where semi-colons
can be used to separate key/value pairs in the query string. You can
likely catch 99% of bad URLs, just depends on how important that other
1% is....
Good luck,
[url]http://danconia.org[/url]
Wiggins D Anconia Guest
-
Wiggins D Anconia #4
Re: Matching invalid characters in a URL
Did you mean to leave out those characters the RFC mentions are reserved> I'm trying to throw out URLs with any invalid characters in them, like
> '@". According to [url]http://www.ietf.org/rfc/rfc1738.txt[/url] :
> Thus, only alphanumerics, the special characters "$-_.+!*'(),", and
> reserved characters used for their reserved purposes may be used
> unencoded within a URL.
>
> I'd like to throw out a URL like
> 'http://jncicancerspectrum.oupjournals.org/cgi/content/full/jnci;91/3/252'
> (even though this one works perfectly fine. Go figure.). I've tried:
> if ($url =~ /^[^A-Za-z0-9$-_.+!*'(),]+$/) { #if there are any
> invalid URL characters in the string
> # Remember, special
> regex characters lose their meaning inside []
> print "Invalid character in URL at line $.: $url\n";
> next;
> }
>
> According to my Camel, special regex characters are supposed to lose
> their special functioning inside []. Yet, that obviously isn't true for
> '-' used to separate the start and end of a range. I thought the fourth
> '-' at '$-' was probably indicating a range, so I tried to escape it by
> preceding it with a backslash or '\Q' but both gave strange errors about
> uninitiated strings in concatenations.
>
> Any suggestions? Thanks for your help and thoughts.
>
for some schemes,
"The characters ";", "/", "?", ":", "@", "=" and "&" are the characters
which may be reserved for special meaning within a scheme."
They should be in the class as well, since you are negating it right?
Just trying to understand completely so I don't throw you off with any
dumb remarks...
[url]http://danconia.org[/url]
Wiggins D Anconia Guest
-
Dan Anderson #5
Re: Matching invalid characters in a URL
On Fri, 2004-01-09 at 16:54, Wiggins d Anconia wrote:
Ooops... My apologies. I was typing too quick and without much>> >> > > Any suggestions? Thanks for your help and thoughts.
> > It is much easier to define the set all chars must be in then not. Use
> > the =! which is the complement of all charachters matched by =~.
> > Alternatively, I believe there is a c option you can use.
> >
> > -Dan
> That (I presume) should be !~ instead of != to complement =~ as opposed
> to ==. When trying to include a dash in a character class (and not make
> it a range), [], place it as the first character in the class, when
> including a carat ^ do NOT place it as the first character (as that
> negates the class).
caffeine. :-(
-Dan
Dan Anderson Guest
-
Kevin Zembower #6
Re: Matching invalid characters in a URL
Thank you all for some first thoughts and clarifying questions.
I'm trying to discard any URL with any character that is not an upper- or lower-case
letter, digit, or the characters $-_.+!*'(), . I realize that some other characters can be
used in special circumstances, but I don't have to allow for any of these in my program.
I thought that my perl statement:
if ($url =~ /^[^A-Za-z0-9$-_.+!*'(),]+$/) { #if there are any invalid URL characters in the string
# Remember, special regex characters lose their meaning inside []
print "Invalid character in URL at line $.: $url\n";
next;
}
is saying:
if the variable $url contains any characters not in the set [A-Za-z0-9$-_.+!*'(),]+$/), print "Invalid ..."
So, I think I need help in two areas; Do I have my logic backwards because I'm trying to match any
character in a variable, and, How do I write the match statement to do what I want.
Thanks, again, for all your help and suggestions.
-Kevin
>>> Wiggins d Anconia <wiggins@danconia.org> 01/09/04 05:01PM >>>Did you mean to leave out those characters the RFC mentions are reserved> I'm trying to throw out URLs with any invalid characters in them, like
> '@". According to [url]http://www.ietf.org/rfc/rfc1738.txt[/url] :
> Thus, only alphanumerics, the special characters "$-_.+!*'(),", and
> reserved characters used for their reserved purposes may be used
> unencoded within a URL.
>
> I'd like to throw out a URL like
> 'http://jncicancerspectrum.oupjournals.org/cgi/content/full/jnci;91/3/252'
> (even though this one works perfectly fine. Go figure.). I've tried:
> if ($url =~ /^[^A-Za-z0-9$-_.+!*'(),]+$/) { #if there are any
> invalid URL characters in the string
> # Remember, special
> regex characters lose their meaning inside []
> print "Invalid character in URL at line $.: $url\n";
> next;
> }
>
> According to my Camel, special regex characters are supposed to lose
> their special functioning inside []. Yet, that obviously isn't true for
> '-' used to separate the start and end of a range. I thought the fourth
> '-' at '$-' was probably indicating a range, so I tried to escape it by
> preceding it with a backslash or '\Q' but both gave strange errors about
> uninitiated strings in concatenations.
>
> Any suggestions? Thanks for your help and thoughts.
>
for some schemes,
"The characters ";", "/", "?", ":", "@", "=" and "&" are the characters
which may be reserved for special meaning within a scheme."
They should be in the class as well, since you are negating it right?
Just trying to understand completely so I don't throw you off with any
dumb remarks...
[url]http://danconia.org[/url]
Kevin Zembower Guest
-
Charles K. Clarkson #7
RE: Matching invalid characters in a URL
KEVIN ZEMBOWER <KZEMBOWE@jhuccp.org]> wrote:
:
: I'm trying to discard any URL with any character that is not
: an upper- or lower-case letter, digit, or the characters
: $-_.+!*'(), . I realize that some other characters can be
: used in special circumstances, but I don't have to allow for
: any of these in my program.
:
: I thought that my perl statement:
: if ($url =~ /^[^A-Za-z0-9$-_.+!*'(),]+$/) {
: print "Invalid character in URL at line $.: $url\n";
: next;
: }
: is saying:
: if the variable $url contains any characters not in the set
: [A-Za-z0-9$-_.+!*'(),]+$/), print "Invalid ..."
No. It is saying if ALL characters are invalid ...
Ignore the character class and look at the rest. There
is no room for a valid character:
/
^ # start at the beginning of the string
[^A-Za-z0-9$-_.+!*'(),]+
$ # end at the end of the string
/
Your anchors are dragging you down. You want to find the
first invalid character. After that it doesn't matter. This
should be fine.
/[^A-Za-z0-9$-_.+!*'(),]/
HTH,
Charles K. Clarkson
--
Head Bottle Washer,
Clarkson Energy Homes, Inc.
Mobile Home Specialists
254 968-8328
Charles K. Clarkson Guest
-
Rob Dixon #8
Re: Matching invalid characters in a URL
Kevin Zembower wrote:
Hi Kevin.>
> Thank you all for some first thoughts and clarifying questions.
>
> I'm trying to discard any URL with any character that is not an upper- or lower-case
> letter, digit, or the characters $-_.+!*'(), . I realize that some other characters can be
> used in special circumstances, but I don't have to allow for any of these in my program.
>
> I thought that my perl statement:
> if ($url =~ /^[^A-Za-z0-9$-_.+!*'(),]+$/) { #if there are any invalid URL characters in the string
> # Remember, special regex characters lose their meaning inside []
> print "Invalid character in URL at line $.: $url\n";
> next;
> }
> is saying:
> if the variable $url contains any characters not in the set [A-Za-z0-9$-_.+!*'(),]+$/), print "Invalid ..."
>
> So, I think I need help in two areas; Do I have my logic backwards because I'm trying to match any
> character in a variable, and, How do I write the match statement to do what I want.
Take note of Charles' points, but also note that Perl is trying to expand
the built-in variable $- into your regex. This is almost certainly zero
unless you're using formats, so you're including the digit zero for a second
time instead of dollar and dash.
If you escape the dollar and code
[A-Za-z0-9\$-_.+!*'(),]
instead, then your class will include all characters from dollar up to
underscore. So you need to escape both dollar and dash:
if ($url =~ /[A-Za-z0-9\$\-_.+!*'(),]/) {
# Remember MOST special regex characters lose their meaning inside []
print "Invalid character in URL at line $.: $url\n";
next;
}
HTH,
Rob
Rob Dixon Guest



Reply With Quote

