Change relative path to absolute in an HTML file

Ask a Question related to PERL Miscellaneous, Design and Development.

  1. #1

    Default Change relative path to absolute in an HTML file

    I need to download an html file and either use it locally or use it on
    another host as a mirror page. But the html has many relative path for
    href and src. How could I change them to absolute once? How do I use
    HTML:TokanPaser to do the work? It seems me to linkextor can get the
    list, but can not change links in the html file.

    I want to change
    <a href="/1/1.html">1.html</a>
    <a href=1/1.html>1.html</a>
    to
    <a href="http://1.com/1/1.html">1.html</a>


    Any ideas or scripts?
    Chinadian Guest

  2. Similar Questions and Discussions

    1. Library path relative to current .rb file
      One of the most irritating (missing) features of Ruby is inability to 'require' files in the same directory or subdirectories as the executing...
    2. Contribute 3.1 absolute/relative path bug?
      When placing images, C3 insists on saving the src attribute as relative to the document regardless of the File Placement settings specified under...
    3. absolute or relative _x/_y coordinates
      Hello friends, allrite here it goes. I have three clips. Lets call'em: clip0,clip1 and clip2 clip1 is INSIDE clip0
    4. Listing absolute path of file
      In article <1069154646.787003@internet.fina.hr>, Pipiron wrote: This will do it recursively: find /path/to/dir -print --
    5. File, relative path handling.
      Before I attempt to re-invent this wheel: Has anyone come up with a method for converting an absolute filesystem path into a relative path, given...
  3. #2

    Default Re: Change relative path to absolute in an HTML file

    Chinadian wrote:
    > I need to download an html file and either use it locally or use it
    > on another host as a mirror page. But the html has many relative
    > path for href and src. How could I change them to absolute once?
    > How do I use HTML:TokanPaser to do the work? It seems me to
    > linkextor can get the list, but can not change links in the html
    > file.
    >
    > I want to change
    > <a href="/1/1.html">1.html</a>
    > <a href=1/1.html>1.html</a>
    > to
    > <a href="http://1.com/1/1.html">1.html</a>
    When it comes to relative, absolute, etc. URLs, the definitions at
    [url]http://www.perldoc.com/perl5.8.0/lib/CGI.html#OBTAINING-THE-SCRIPT'S-URL[/url]
    may be useful. I would recommend that you study those definitions, and
    then reconsider if you really want to change the links and, if so, why.

    --
    Gunnar Hjalmarsson
    Email: [url]http://www.gunnar.cc/cgi-bin/contact.pl[/url]

    Gunnar Hjalmarsson Guest

  4. #3

    Default Re: Change relative path to absolute in an HTML file

    Chinadian <chinadian@ma.2y.net> wrote:
    > I want to change
    ><a href="/1/1.html">1.html</a>
    ><a href=1/1.html>1.html</a>

    The 2nd one is not HTML, so you are on your own for that one.

    > to
    ><a href="http://1.com/1/1.html">1.html</a>

    s#href="#href="http://1.com#g;


    Pattern matching won't work right on arbitrary HTML, only on
    the HTML that you've shown us. These will break it for instance:

    <a href='/1/1.html'>1.html</a>

    <a href = "/1/1.html">1.html</a>

    <!-- <a href="/1/1.html">1.html</a> -->

    <a
    href
    =
    "/1/1.html"
    >1.html</a
    >

    --
    Tad McClellan SGML consulting
    [email]tadmc@augustmail.com[/email] Perl programming
    Fort Worth, Texas
    Tad McClellan Guest

  5. #4

    Default Re: Change relative path to absolute in an HTML file

    > Check out URI ([url]http://search.cpan.org/author/GAAS/URI-1.25/URI.pm[/url]). The
    > new_abs() method changes relative links to absolute, and ignores
    > absolute links, something like this:
    >
    > #!/usr/bin/perl -w
    > # rel2abs
    > use strict;
    > use warnings;
    > use URI;
    >
    > print rel2abs($ARGV[0]);
    >
    > sub rel2abs {
    > my $base = 'http://1.com';
    > return URI->new_abs($_[0],$base);
    > }
    >
    > then:
    >
    > rel2abs /1/1.html prints [url]http://1.com/1/1.html[/url]
    > rel2abs 1/1.html prints [url]http://1.com/1/1.html[/url]
    > rel2abs [url]http://www.google.com[/url] prints [url]http://www.google.com[/url]
    >
    > HTH - keith
    this won't work, because it will replace all of the 1/1.html to
    [url]http://1.com/1/1.html[/url], so all /1/1.html will be /http://1.com/1/1.html
    in the html file.

    here is what i wrote, but it is not working:

    $key='abc.jpg'; $base='http://abc.com';
    $old=q(<href="http://abc.com/ABC.jpg"> <href="/abc.jpg">
    <href="http://www.abc.com/ABC.jpg"> <href=abc.jpg>);

    $old =~ s/href=\s*\"?(?!http)\/?(?=.*)\"?/href=$base\//gsi;
    print $old;


    i should get 4 href=http://abc.com/abc.jpg, but i got this:
    <href=http://abc.com/://abc.com/ABC.jpg"> <href="/abc.jpg">
    <href=http://abc.com
    /://www.abc.com/ABC.jpg"> <href=abc.jpg>

    i think the problem is i want to look for href followed by no http,
    but it matches the href=, then replace href= with href=$base.

    another problem, how do i match the final " to delete the "?
    Chinadian Guest

  6. #5

    Default Re: Change relative path to absolute in an HTML file

    [email]tadmc@augustmail.com[/email] (Tad McClellan) wrote in message news:<slrnbkuhqr.2s8.tadmc@magna.augustmail.com>.. .
    > Chinadian <chinadian@ma.2y.net> wrote:
    >
    > > I want to change
    > ><a href="/1/1.html">1.html</a>
    > ><a href=1/1.html>1.html</a>
    >
    >
    > The 2nd one is not HTML, so you are on your own for that one.
    >
    >
    > > to
    > ><a href="http://1.com/1/1.html">1.html</a>
    >
    >
    > s#href="#href="http://1.com#g;
    >
    yours is not working, it will change
    <a href="http://1.com/1/1.html">1.html</a>
    to
    <a href="http://1.comhttp://1.com/1/1.html">1.html</a>
    Chinadian Guest

  7. #6

    Default Re: Change relative path to absolute in an HTML file

    Chinadian <chinadian@ma.2y.net> wrote:
    > [email]tadmc@augustmail.com[/email] (Tad McClellan) wrote in message news:<slrnbkuhqr.2s8.tadmc@magna.augustmail.com>.. .
    >> Chinadian <chinadian@ma.2y.net> wrote:
    >>
    >> > I want to change
    >> ><a href="/1/1.html">1.html</a>
    >> > to
    >> ><a href="http://1.com/1/1.html">1.html</a>
    >>
    >>
    >> s#href="#href="http://1.com#g;
    >>
    >
    > yours is not working, it will change
    ><a href="http://1.com/1/1.html">1.html</a>
    > to
    ><a href="http://1.comhttp://1.com/1/1.html">1.html</a>

    Well yes, because you did not say that your data contained that,
    and I failed to correctly read your mind.

    You showed us data, my code works on the data you showed us.

    If you change the question, you can expect that the answer
    will need to change too.

    So ask the complete question the first time.


    ( Regexes are not powerful enough to handle arbitrary HTML,
    for that you'd need a real parser.
    )


    --
    Tad McClellan SGML consulting
    [email]tadmc@augustmail.com[/email] Perl programming
    Fort Worth, Texas
    Tad McClellan Guest

  8. #7

    Default Re: Change relative path to absolute in an HTML file

    Chinadian wrote:
    > this won't work, because it will replace all of the 1/1.html to
    > [url]http://1.com/1/1.html[/url], so all /1/1.html will be /http://1.com/1/1.html
    > in the html file.
    The posted code works, did you read the URI docs?

    $uri = URI->new_abs( $str, $base_uri )
    This constructs a new absolute URI object. The $str argument can
    denote a relative or absolute URI. If relative, then it will be
    absolutized using $base_uri as base. The $base_uri must be an
    absolute URI.

    So whether you pass '1/1.html' or '/1/1,html', as the first argument,
    you get the same thing - 'http://1.com/1/1.html'. new_abs() *does not*
    parse HTML for you. You need to extract the links using one of the HTML
    parsers.
    > here is what i wrote, but it is not working:
    >
    > $key='abc.jpg'; $base='http://abc.com';
    > $old=q(<href="http://abc.com/ABC.jpg"> <href="/abc.jpg">
    > <href="http://www.abc.com/ABC.jpg"> <href=abc.jpg>);
    >
    > $old =~ s/href=\s*\"?(?!http)\/?(?=.*)\"?/href=$base\//gsi;
    > print $old;
    >
    >
    > i should get 4 href=http://abc.com/abc.jpg, but i got this:
    > <href=http://abc.com/://abc.com/ABC.jpg"> <href="/abc.jpg">
    > <href=http://abc.com
    > /://www.abc.com/ABC.jpg"> <href=abc.jpg>
    >
    > i think the problem is i want to look for href followed by no http,
    > but it matches the href=, then replace href= with href=$base.
    >
    > another problem, how do i match the final " to delete the "?
    A couple of problems here:

    1. $old isn't HTML.
    2. As Tad pointed out twice, pattern matching won't work on arbitrary
    HTML. Use a parser:

    ==CODE==
    #!/usr/bin/perl -w
    use strict;
    use HTML::TreeBuilder;
    use URI;

    my $base_uri = 'http://1.com';

    my $test_string=<<_TS_;
    <a href="http://1.com/1/1.html"></a>
    <a href="/1/1.html"></a>
    <a href="1/1.html"></a>
    <a href="http://www.google.com"></a>
    _TS_

    my $root = HTML::TreeBuilder->new();
    my $html = $root->parse($test_string);
    my @a = $html->look_down('_tag','a');

    foreach (@a) {
    my $str = $_->attr('href');
    my $abs_uri = URI->new_abs($str,$base_uri);
    $_->attr('href',$abs_uri);
    }

    print $_->starttag, "\n" foreach (@a);

    ==RESULTS==
    <a href="http://1.com/1/1.html">
    <a href="http://1.com/1/1.html">
    <a href="http://1.com/1/1.html">
    <a href="http://www.google.com">

    Notice that links to outside domains are kept intact. Look at the
    HTML::TreeBuilder documentation and HTML::Element (look_down() and
    attr() methods). The code only extracts 'href' from A tags, so you'll
    have to modify to extract IMG and others.


    ko Guest

  9. #8

    Default Re: Change relative path to absolute in an HTML file

    why do you say regular exp does not work with complicated html? my RE
    works perfectly changing rel to abs now, here is the code. tell me if
    you can find a case it does not work:

    where $url is the base

    $htmlcode =~ s/href=\s*\"?\/?(?!\s*\"?(http\:\/\/|mailto))/href=\"$url\//gsi;
    $htmlcode =~ s/(href=\"[^\>\s\'\"]+)\"?/$+\"/gsi;
    Chinadian Guest

  10. #9

    Default Re: Change relative path to absolute in an HTML file

    Chinadian wrote:
    > why do you say regular exp does not work with complicated html? my RE
    > works perfectly changing rel to abs now, here is the code. tell me if
    > you can find a case it does not work:
    >
    > where $url is the base
    >
    > $htmlcode =~ s/href=\s*\"?\/?(?!\s*\"?(http\:\/\/|mailto))/href=\"$url\//gsi;
    > $htmlcode =~ s/(href=\"[^\>\s\'\"]+)\"?/$+\"/gsi;
    Not necessary for me to find a case where it doesn't work. See Tad's
    first post and try the regexp on all of the examples he gave.

    ko Guest

  11. #10

    Default Re: Change relative path to absolute in an HTML file

    Chinadian <chinadian@ma.2y.net> wrote:
    > why do you

    Who "you"?

    > say regular exp does not work with complicated html?

    Because regular exp does not work with complicated html.

    > my RE
    > works perfectly changing rel to abs now,

    Then the HTML that you've tried it with is not sufficiently complicated.

    Try it with a more complete test suite like the ones shown in
    the Perl FAQ.

    > here is the code. tell me if
    > you can find a case it does not work:

    I already showed you several cases where it will not work.

    Fix all of those, and post your new code.

    Then we'll point out some more cases to go handle.

    Fix all of those, and post your new code.

    Then we'll point out some more cases to go handle...

    Lather. Rinse. Repeat.

    We will be able to find deficiencies faster than you can fix them.

    > $htmlcode =~ s/(href=\"[^\>\s\'\"]+)\"?/$+\"/gsi;

    This code tells me something. It tells me that you don't really
    know Perl's regexes very well.

    Angle brackets are not special in regexes, they do not need backslashing.

    Double quotes are not special in regexes, they do not need backslashing.

    Single quotes are not special in regexes, they do not need backslashing.

    Double quotes are not special in strings, they do not need
    backslashing in the replacement string either.

    The /s option changes the meaning of dot, but you don't even have
    a dot in your pattern, /s doesn't do anything. Why is it there if
    it does not do anything?


    $htmlcode =~ s/(href="[^>\s'"]+)"?/$+"/gi; # does the same thing

    (but both things are incorrect.)


    --
    Tad McClellan SGML consulting
    [email]tadmc@augustmail.com[/email] Perl programming
    Fort Worth, Texas
    Tad McClellan Guest

Posting Permissions

  • You may not post new threads
  • You may post replies
  • You may not post attachments
  • You may not edit your posts

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139