Professional Web Applications Themes

parsing HTML - PERL Beginners

I am trying to build a HTML editor for use with my HTML::Mason site. I intend for it to support nested tables, SPANs, and anchors. I am looking for a module that can help me p existing HTML (custom or generated by my scripts) into a tree structure similar to: my $html = [ { tag => 'table', id => 'maintable', width => 300, content => [ { tag => 'tr', content => [ { tag => 'td', width => 200, content => "some content" }, { tag => 'td', width => 100, content => "more content" } ] ] ...

  1. #1

    Default parsing HTML

    I am trying to build a HTML editor for use with my HTML::Mason site. I intend
    for it to support nested tables, SPANs, and anchors. I am looking for a module
    that can help me p existing HTML (custom or generated by my scripts) into a
    tree structure similar to:

    my $html = [ { tag => 'table', id => 'maintable', width => 300, content =>
    [ { tag => 'tr', content =>
    [
    { tag => 'td', width => 200, content => "some content" },
    { tag => 'td', width => 100, content => "more content" }
    ]
    ]
    ]; # Not tested, but you get the idea

    which would correspond to the following HTML:

    <table id="maintable" width="300">
    <tr>
    <td width="200">some content</td>
    <td width="100">more content</td>
    </tr>
    </table>

    Once I have the data in the tree, I can easily modify it and transform it back
    into HTML. Is there a module that can help make this easier or should I go about
    this differently?

    --
    Andrew Gaffney
    Network Administrator
    Skyline Aeronautics, LLC.
    636-357-1548

    Andrew Guest

  2. #2

    Default Re: parsing HTML

    On 7/21/2004 10:42 PM, Andrew Gaffney wrote:
     

    HTML::Pr doesn't build a tree, but you can use it to build one if
    neccessary. However, you might find building a tree is not neccessary.
    And this is less memory intensive.

    Then there is HTML::Tree.

    Regards,
    Randy.


    Randy Guest

  3. #3

    Default Re: parsing HTML

    Randy W. Sims wrote: 
    >
    > HTML::Pr doesn't build a tree, but you can use it to build one if
    > neccessary. However, you might find building a tree is not neccessary.
    > And this is less memory intensive.
    >
    > Then there is HTML::Tree.[/ref]

    I'd rather generate a structure similar to what I have above instead of having a
    large tree of class objects that takes up more RAM and is probably slower. How
    would I go about generating a structure such as that above using HTML::Pr?

    --
    Andrew Gaffney
    Network Administrator
    Skyline Aeronautics, LLC.
    636-357-1548

    Andrew Guest

  4. #4

    Default Re: parsing HTML

    On 7/21/2004 11:24 PM, Andrew Gaffney wrote:
     [/ref][/ref]

    [snip]
     

    Prs like HTML::Pr scan a doent and upon encountering certain
    tokens fire off events. In the case of HTML::Pr, events are fired
    when encountering a start tag, the text between tags, and at the end
    tag. If you have an arbitrarily deep doent structure like HTML, you
    can store the structure using a stack:

    #!/usr/bin/perl
    package SamplePr;

    use strict;

    use HTML::Pr;
    use base qw(HTML::Pr);

    sub start {
    my($self, $tagname, $attr, $attrseq, $origtext) = _;
    my $stack = $self->{_stack};
    my $depth = $stack ? $stack : 0;
    print ' ' x $depth, "<$tagname>\n";
    push {$self->{_stack}}, ' ';
    }

    sub end {
    my($self, $tagname, $origtext) = _;
    pop {$self->{_stack}};
    my $stack = $self->{_stack};
    my $depth = $stack ? $stack : 0;
    print ' ' x $depth, "<\\$tagname>\n";
    }

    1;

    package main;

    use strict;
    use warnings;

    my $p = SamplePr->new();
    $p->p_file(\*DATA);

    __DATA__
    <html>
    <head>
    <title>Title</title>
    <body>
    The body.
    </body>
    </html>


    Randy Guest

  5. #5

    Default Re: parsing HTML

    Randy W. Sims wrote: [/ref]
    >
    > [snip]

    >
    >
    > Prs like HTML::Pr scan a doent and upon encountering certain
    > tokens fire off events. In the case of HTML::Pr, events are fired
    > when encountering a start tag, the text between tags, and at the end
    > tag. If you have an arbitrarily deep doent structure like HTML, you
    > can store the structure using a stack:
    >
    > #!/usr/bin/perl
    > package SamplePr;
    >
    > use strict;
    >
    > use HTML::Pr;
    > use base qw(HTML::Pr);
    >
    > sub start {
    > my($self, $tagname, $attr, $attrseq, $origtext) = _;
    > my $stack = $self->{_stack};
    > my $depth = $stack ? $stack : 0;
    > print ' ' x $depth, "<$tagname>\n";
    > push {$self->{_stack}}, ' ';
    > }
    >
    > sub end {
    > my($self, $tagname, $origtext) = _;
    > pop {$self->{_stack}};
    > my $stack = $self->{_stack};
    > my $depth = $stack ? $stack : 0;
    > print ' ' x $depth, "<\\$tagname>\n";
    > }
    >
    > 1;
    >
    > package main;
    >
    > use strict;
    > use warnings;
    >
    > my $p = SamplePr->new();
    > $p->p_file(\*DATA);
    >
    > __DATA__
    > <html>
    > <head>
    > <title>Title</title>
    > <body>
    > The body.
    > </body>
    > </html>[/ref]

    Thanks. In the time it took you to put that together, I came up with the
    following to figure out how HTML::Pr works. I'll use your code to expand
    upon it.

    #!/usr/bin/perl

    use strict;
    use warnings;

    use HTML::Pr ();

    sub start {
    print "start ";
    foreach my $arg (_) {
    if(ref($arg) eq 'HASH') {
    foreach my $key(keys %{$arg}) {
    print " $key - $arg->{$key}\n";
    }
    } else {
    print "$arg\n";
    }
    }
    }

    sub end {
    print "end ";
    foreach(_) {
    print "$_\n";
    }
    }

    sub text {
    my $text = shift;

    chomp $text;
    print " text - '$text'\n" if($text ne '');
    }

    my $p = HTML::Pr->new( api_version => 3,
    start_h => [\&start, "tagname, attr"],
    end_h => [\&end, "tagname"],
    text_h => [\&text, "dtext"],
    marked_sections => 1 ); # Not sure what this does

    $p->p_file("test.html");

    The above gives me the expected output for the sample HTML I provided before.

    --
    Andrew Gaffney
    Network Administrator
    Skyline Aeronautics, LLC.
    636-357-1548

    Andrew Guest

  6. #6

    Default Re: parsing HTML

    Andrew Gaffney wrote: 
    >>
    >> [snip]
    >> 
    >>
    >> Prs like HTML::Pr scan a doent and upon encountering
    >> certain tokens fire off events. In the case of HTML::Pr, events
    >> are fired when encountering a start tag, the text between tags, and at
    >> the end tag. If you have an arbitrarily deep doent structure like
    >> HTML, you can store the structure using a stack:[/ref][/ref]

    <SNIP>
     

    <SNIP>

    Here is my current working code. Please take a look at it and see if there are
    any obvious (or not so obvious) problems. I thought this would end up being far
    more difficult.

    phtml.pl
    ============
    #!/usr/bin/perl

    use strict;
    use warnings;

    use HTML::Pr ();

    my $htmltree = [ { tag => 'doent', content => [] } ];
    my $node = $htmltree->[0]->{content};
    my prevnodes = ($htmltree);

    sub start {
    my $tagname = shift;
    my $attr = shift;
    my $newnode = {};

    $newnode->{tag} = $tagname;
    foreach my $key(keys %{$attr}) {
    $newnode->{$key} = $attr->{$key};
    }
    $newnode->{content} = [];
    push prevnodes, $node;
    push {$node}, $newnode;
    $node = $newnode->{content};
    }

    sub end {
    my $tagname = shift;

    $node = pop prevnodes;
    }

    sub text {
    my $text = shift;

    chomp $text;
    if($text ne '') {
    push {$node}, $text;
    }
    }

    my $p = HTML::Pr->new( api_version => 3,
    start_h => [\&start, "tagname, attr"],
    end_h => [\&end, "tagname"],
    text_h => [\&text, "dtext"] );

    $p->p_file("test.html");

    use Data::Dumper;
    print Dumper $htmltree;

    test.html
    =========
    <table id="maintable" width="300">
    <tr>
    <td width="200">some content</td>
    <td width="100">more content</td>
    </tr>
    </table>

    --
    Andrew Gaffney
    Network Administrator
    Skyline Aeronautics, LLC.
    636-357-1548

    Andrew Guest

  7. #7

    Default Re: parsing HTML

    Andrew Gaffney wrote: 

    <snip code>

    Looks good to me. Once you get used to the idea of event based parsing,
    storing context information on a stack, it's really simple, and even
    fun. Another nice thing is once you've mastered one (HTML::Pr),
    you've mastered them all (Pod::Pr, XML::Pr, etc.).

    Regards,
    Randy.
    Randy Guest

Similar Threads

  1. Html parsing
    By j in forum Macromedia Director Lingo
    Replies: 6
    Last Post: April 18th, 01:29 PM
  2. HTML Parsing?
    By Martin in forum Ruby
    Replies: 11
    Last Post: February 11th, 01:31 AM
  3. HTML parsing
    By Gavin in forum Ruby
    Replies: 4
    Last Post: February 2nd, 02:03 PM
  4. Parsing Html
    By Colum in forum PHP Development
    Replies: 2
    Last Post: October 30th, 10:22 PM

Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139