Ask a Question related to PERL Miscellaneous, Design and Development.
-
Chinadian #1
Change relative path to absolute in an HTML file
I need to download an html file and either use it locally or use it on
another host as a mirror page. But the html has many relative path for
href and src. How could I change them to absolute once? How do I use
HTML:TokanPaser to do the work? It seems me to linkextor can get the
list, but can not change links in the html file.
I want to change
<a href="/1/1.html">1.html</a>
<a href=1/1.html>1.html</a>
to
<a href="http://1.com/1/1.html">1.html</a>
Any ideas or scripts?
Chinadian Guest
-
Library path relative to current .rb file
One of the most irritating (missing) features of Ruby is inability to 'require' files in the same directory or subdirectories as the executing... -
Contribute 3.1 absolute/relative path bug?
When placing images, C3 insists on saving the src attribute as relative to the document regardless of the File Placement settings specified under... -
absolute or relative _x/_y coordinates
Hello friends, allrite here it goes. I have three clips. Lets call'em: clip0,clip1 and clip2 clip1 is INSIDE clip0 -
Listing absolute path of file
In article <1069154646.787003@internet.fina.hr>, Pipiron wrote: This will do it recursively: find /path/to/dir -print -- -
File, relative path handling.
Before I attempt to re-invent this wheel: Has anyone come up with a method for converting an absolute filesystem path into a relative path, given... -
Gunnar Hjalmarsson #2
Re: Change relative path to absolute in an HTML file
Chinadian wrote:
When it comes to relative, absolute, etc. URLs, the definitions at> I need to download an html file and either use it locally or use it
> on another host as a mirror page. But the html has many relative
> path for href and src. How could I change them to absolute once?
> How do I use HTML:TokanPaser to do the work? It seems me to
> linkextor can get the list, but can not change links in the html
> file.
>
> I want to change
> <a href="/1/1.html">1.html</a>
> <a href=1/1.html>1.html</a>
> to
> <a href="http://1.com/1/1.html">1.html</a>
[url]http://www.perldoc.com/perl5.8.0/lib/CGI.html#OBTAINING-THE-SCRIPT'S-URL[/url]
may be useful. I would recommend that you study those definitions, and
then reconsider if you really want to change the links and, if so, why.
--
Gunnar Hjalmarsson
Email: [url]http://www.gunnar.cc/cgi-bin/contact.pl[/url]
Gunnar Hjalmarsson Guest
-
Tad McClellan #3
Re: Change relative path to absolute in an HTML file
Chinadian <chinadian@ma.2y.net> wrote:
> I want to change
><a href="/1/1.html">1.html</a>
><a href=1/1.html>1.html</a>
The 2nd one is not HTML, so you are on your own for that one.
> to
><a href="http://1.com/1/1.html">1.html</a>
s#href="#href="http://1.com#g;
Pattern matching won't work right on arbitrary HTML, only on
the HTML that you've shown us. These will break it for instance:
<a href='/1/1.html'>1.html</a>
<a href = "/1/1.html">1.html</a>
<!-- <a href="/1/1.html">1.html</a> -->
<a
href
=
"/1/1.html">1.html</a
>
--
Tad McClellan SGML consulting
[email]tadmc@augustmail.com[/email] Perl programming
Fort Worth, Texas
Tad McClellan Guest
-
Chinadian #4
Re: Change relative path to absolute in an HTML file
> Check out URI ([url]http://search.cpan.org/author/GAAS/URI-1.25/URI.pm[/url]). The
this won't work, because it will replace all of the 1/1.html to> new_abs() method changes relative links to absolute, and ignores
> absolute links, something like this:
>
> #!/usr/bin/perl -w
> # rel2abs
> use strict;
> use warnings;
> use URI;
>
> print rel2abs($ARGV[0]);
>
> sub rel2abs {
> my $base = 'http://1.com';
> return URI->new_abs($_[0],$base);
> }
>
> then:
>
> rel2abs /1/1.html prints [url]http://1.com/1/1.html[/url]
> rel2abs 1/1.html prints [url]http://1.com/1/1.html[/url]
> rel2abs [url]http://www.google.com[/url] prints [url]http://www.google.com[/url]
>
> HTH - keith
[url]http://1.com/1/1.html[/url], so all /1/1.html will be /http://1.com/1/1.html
in the html file.
here is what i wrote, but it is not working:
$key='abc.jpg'; $base='http://abc.com';
$old=q(<href="http://abc.com/ABC.jpg"> <href="/abc.jpg">
<href="http://www.abc.com/ABC.jpg"> <href=abc.jpg>);
$old =~ s/href=\s*\"?(?!http)\/?(?=.*)\"?/href=$base\//gsi;
print $old;
i should get 4 href=http://abc.com/abc.jpg, but i got this:
<href=http://abc.com/://abc.com/ABC.jpg"> <href="/abc.jpg">
<href=http://abc.com
/://www.abc.com/ABC.jpg"> <href=abc.jpg>
i think the problem is i want to look for href followed by no http,
but it matches the href=, then replace href= with href=$base.
another problem, how do i match the final " to delete the "?
Chinadian Guest
-
Chinadian #5
Re: Change relative path to absolute in an HTML file
[email]tadmc@augustmail.com[/email] (Tad McClellan) wrote in message news:<slrnbkuhqr.2s8.tadmc@magna.augustmail.com>.. .
yours is not working, it will change> Chinadian <chinadian@ma.2y.net> wrote:
>>> > I want to change
> ><a href="/1/1.html">1.html</a>
> ><a href=1/1.html>1.html</a>
>
> The 2nd one is not HTML, so you are on your own for that one.
>
>>> > to
> ><a href="http://1.com/1/1.html">1.html</a>
>
> s#href="#href="http://1.com#g;
>
<a href="http://1.com/1/1.html">1.html</a>
to
<a href="http://1.comhttp://1.com/1/1.html">1.html</a>
Chinadian Guest
-
Tad McClellan #6
Re: Change relative path to absolute in an HTML file
Chinadian <chinadian@ma.2y.net> wrote:
> [email]tadmc@augustmail.com[/email] (Tad McClellan) wrote in message news:<slrnbkuhqr.2s8.tadmc@magna.augustmail.com>.. .>> Chinadian <chinadian@ma.2y.net> wrote:
>>>> > I want to change
>> ><a href="/1/1.html">1.html</a>>>>>> > to
>> ><a href="http://1.com/1/1.html">1.html</a>
>>
>> s#href="#href="http://1.com#g;
>>
> yours is not working, it will change
><a href="http://1.com/1/1.html">1.html</a>
> to
><a href="http://1.comhttp://1.com/1/1.html">1.html</a>
Well yes, because you did not say that your data contained that,
and I failed to correctly read your mind.
You showed us data, my code works on the data you showed us.
If you change the question, you can expect that the answer
will need to change too.
So ask the complete question the first time.
( Regexes are not powerful enough to handle arbitrary HTML,
for that you'd need a real parser.
)
--
Tad McClellan SGML consulting
[email]tadmc@augustmail.com[/email] Perl programming
Fort Worth, Texas
Tad McClellan Guest
-
ko #7
Re: Change relative path to absolute in an HTML file
Chinadian wrote:
The posted code works, did you read the URI docs?> this won't work, because it will replace all of the 1/1.html to
> [url]http://1.com/1/1.html[/url], so all /1/1.html will be /http://1.com/1/1.html
> in the html file.
$uri = URI->new_abs( $str, $base_uri )
This constructs a new absolute URI object. The $str argument can
denote a relative or absolute URI. If relative, then it will be
absolutized using $base_uri as base. The $base_uri must be an
absolute URI.
So whether you pass '1/1.html' or '/1/1,html', as the first argument,
you get the same thing - 'http://1.com/1/1.html'. new_abs() *does not*
parse HTML for you. You need to extract the links using one of the HTML
parsers.
A couple of problems here:> here is what i wrote, but it is not working:
>
> $key='abc.jpg'; $base='http://abc.com';
> $old=q(<href="http://abc.com/ABC.jpg"> <href="/abc.jpg">
> <href="http://www.abc.com/ABC.jpg"> <href=abc.jpg>);
>
> $old =~ s/href=\s*\"?(?!http)\/?(?=.*)\"?/href=$base\//gsi;
> print $old;
>
>
> i should get 4 href=http://abc.com/abc.jpg, but i got this:
> <href=http://abc.com/://abc.com/ABC.jpg"> <href="/abc.jpg">
> <href=http://abc.com
> /://www.abc.com/ABC.jpg"> <href=abc.jpg>
>
> i think the problem is i want to look for href followed by no http,
> but it matches the href=, then replace href= with href=$base.
>
> another problem, how do i match the final " to delete the "?
1. $old isn't HTML.
2. As Tad pointed out twice, pattern matching won't work on arbitrary
HTML. Use a parser:
==CODE==
#!/usr/bin/perl -w
use strict;
use HTML::TreeBuilder;
use URI;
my $base_uri = 'http://1.com';
my $test_string=<<_TS_;
<a href="http://1.com/1/1.html"></a>
<a href="/1/1.html"></a>
<a href="1/1.html"></a>
<a href="http://www.google.com"></a>
_TS_
my $root = HTML::TreeBuilder->new();
my $html = $root->parse($test_string);
my @a = $html->look_down('_tag','a');
foreach (@a) {
my $str = $_->attr('href');
my $abs_uri = URI->new_abs($str,$base_uri);
$_->attr('href',$abs_uri);
}
print $_->starttag, "\n" foreach (@a);
==RESULTS==
<a href="http://1.com/1/1.html">
<a href="http://1.com/1/1.html">
<a href="http://1.com/1/1.html">
<a href="http://www.google.com">
Notice that links to outside domains are kept intact. Look at the
HTML::TreeBuilder documentation and HTML::Element (look_down() and
attr() methods). The code only extracts 'href' from A tags, so you'll
have to modify to extract IMG and others.
ko Guest
-
Chinadian #8
Re: Change relative path to absolute in an HTML file
why do you say regular exp does not work with complicated html? my RE
works perfectly changing rel to abs now, here is the code. tell me if
you can find a case it does not work:
where $url is the base
$htmlcode =~ s/href=\s*\"?\/?(?!\s*\"?(http\:\/\/|mailto))/href=\"$url\//gsi;
$htmlcode =~ s/(href=\"[^\>\s\'\"]+)\"?/$+\"/gsi;
Chinadian Guest
-
ko #9
Re: Change relative path to absolute in an HTML file
Chinadian wrote:
Not necessary for me to find a case where it doesn't work. See Tad's> why do you say regular exp does not work with complicated html? my RE
> works perfectly changing rel to abs now, here is the code. tell me if
> you can find a case it does not work:
>
> where $url is the base
>
> $htmlcode =~ s/href=\s*\"?\/?(?!\s*\"?(http\:\/\/|mailto))/href=\"$url\//gsi;
> $htmlcode =~ s/(href=\"[^\>\s\'\"]+)\"?/$+\"/gsi;
first post and try the regexp on all of the examples he gave.
ko Guest
-
Tad McClellan #10
Re: Change relative path to absolute in an HTML file
Chinadian <chinadian@ma.2y.net> wrote:
> why do you
Who "you"?
> say regular exp does not work with complicated html?
Because regular exp does not work with complicated html.
> my RE
> works perfectly changing rel to abs now,
Then the HTML that you've tried it with is not sufficiently complicated.
Try it with a more complete test suite like the ones shown in
the Perl FAQ.
> here is the code. tell me if
> you can find a case it does not work:
I already showed you several cases where it will not work.
Fix all of those, and post your new code.
Then we'll point out some more cases to go handle.
Fix all of those, and post your new code.
Then we'll point out some more cases to go handle...
Lather. Rinse. Repeat.
We will be able to find deficiencies faster than you can fix them.
> $htmlcode =~ s/(href=\"[^\>\s\'\"]+)\"?/$+\"/gsi;
This code tells me something. It tells me that you don't really
know Perl's regexes very well.
Angle brackets are not special in regexes, they do not need backslashing.
Double quotes are not special in regexes, they do not need backslashing.
Single quotes are not special in regexes, they do not need backslashing.
Double quotes are not special in strings, they do not need
backslashing in the replacement string either.
The /s option changes the meaning of dot, but you don't even have
a dot in your pattern, /s doesn't do anything. Why is it there if
it does not do anything?
$htmlcode =~ s/(href="[^>\s'"]+)"?/$+"/gi; # does the same thing
(but both things are incorrect.)
--
Tad McClellan SGML consulting
[email]tadmc@augustmail.com[/email] Perl programming
Fort Worth, Texas
Tad McClellan Guest



Reply With Quote

