Splitting up an XML File

Ask a Question related to PERL Miscellaneous, Design and Development.

  1. #1

    Default Splitting up an XML File

    I have an XML file that looks like this:

    <root>
    <economist publications="true" >
    <name>
    <first>John</first>
    <last>Doe</last>
    </name>
    <keywords>
    <keyword>Foo</keyword>
    <keyword>Bar</keyword>
    </keywords>
    <title>Indian Chief</title>
    </economist>

    <economist publications="true" >
    <name>
    <first>Jane</first>
    <last>Smith</last>
    </name>
    <keywords>
    <keyword>More Foo</keyword>
    <keyword>More Bar</keyword>
    </keywords>
    <title>President</title>
    </economist>
    </root>

    But the actual file has about 100 <economist> elements.
    I need to write some Perl code to parse this XML file and
    write out 100 smaller XML files, each file corresponding to one
    <economist> element.

    So in my example, I'd write 2 smaller files, one that
    looks like this:
    <economist publications="true" >
    <name>
    <first>John</first>
    <last>Doe</last>
    </name>
    <keywords>
    <keyword>Foo</keyword>
    <keyword>Bar</keyword>
    </keywords>
    <title>Indian Chief</title>
    </economist>

    and one that looks like this:
    <economist publications="true" >
    <name>
    <first>Jane</first>
    <last>Smith</last>
    </name>
    <keywords>
    <keyword>More Foo</keyword>
    <keyword>More Bar</keyword>
    </keywords>
    <title>President</title>
    </economist>

    There are some nested elements in the real file, so I think
    XML::Simple won't work for this.

    Any ideas about how I can do this? I don't need to do any processing
    (at least not now) - just reading and writing smaller chunks.

    Thanks!
    JAG Guest

  2. Similar Questions and Discussions

    1. splitting a PDF file
      Does anyone know if or how to split a multiple page PDF file into single page PDF files in Acrobat 7? I heard this could be done without purchasing...
    2. PHP variable splitting
      Is there a command in PHP that will allow me to split a text variable into an array by length rather than using a delimiter? What I want to do is...
    3. Splitting a Logo
      Hi Is it possible to split a Logo cleanly in half and if so how would i do this? TOON TOON.
    4. Splitting OR Regex
      On Thu, 30 Oct 2003 23:37:55 -0500, Scott, Joshua wrote: This is a FAQ: perldoc -q delimit -- Tore Aursand <tore@aursand.no>
    5. splitting an array
      Hi All , I have one array of numbers say (12 17 18 19 120 121 122 123 124 379 480 481). Now I want to get the starting and ending of any...
  3. #2

    Default Re: Splitting up an XML File

    JAG <jeffg@programmer.net> wrote:
    > But the actual file has about 100 <economist> elements.
    > I need to write some Perl code to parse this XML file and
    > write out 100 smaller XML files, each file corresponding to one
    ><economist> element.
    > There are some nested elements in the real file,

    I will assume that <economist> is NOT nested, and that the
    start/end tags are on lines by themselves.

    > Any ideas about how I can do this?

    # strip non-<economist> stuff at top of file
    $/ = "<economist>\n";
    while ( <> ) { # read one <economist> element per loop iteration
    # open file, output $_ to file, close file.
    }


    --
    Tad McClellan SGML consulting
    [email]tadmc@augustmail.com[/email] Perl programming
    Fort Worth, Texas
    Tad McClellan Guest

  4. #3

    Default Re: Splitting up an XML File

    Tad McClellan <tadmc@augustmail.com> wrote:
    > $/ = "<economist>\n";

    Oops! That should have been:

    $/ = "</economist>\n";


    --
    Tad McClellan SGML consulting
    [email]tadmc@augustmail.com[/email] Perl programming
    Fort Worth, Texas
    Tad McClellan Guest

  5. #4

    Default Re: Splitting up an XML File

    [email]jeffg@programmer.net[/email] (JAG) wrote in message news:<6b40b6b9.0309171009.20d66b6a@posting.google. com>...
    > I have an XML file that looks like this:
    >
    <snip />
    >
    > But the actual file has about 100 <economist> elements.
    > I need to write some Perl code to parse this XML file and
    > write out 100 smaller XML files, each file corresponding to one
    > <economist> element.
    >
    > So in my example, I'd write 2 smaller files, one that
    > looks like this:
    <snip />
    >
    > There are some nested elements in the real file, so I think
    > XML::Simple won't work for this.
    >
    > Any ideas about how I can do this? I don't need to do any processing
    > (at least not now) - just reading and writing smaller chunks.
    >
    This uses one of my favorite modules, XML::XPath:

    [trwww@waveright trwww]$ perl
    use warnings;
    use strict;
    use XML::XPath;
    use IO::File;

    my($xp) = XML::XPath->new( xml => join('', <DATA>) );
    my($nodeset) = $xp->find( '/root/economist' );

    my($ext) = 0;

    foreach my $record ( $nodeset->get_nodelist() ) {
    IO::File->new('> record.'.$ext++)->print($record->toString());
    }

    __DATA__
    <root>
    <economist publications="true" >
    <name>
    <first>John</first>
    <last>Doe</last>
    </name>
    <keywords>
    <keyword>Foo</keyword>
    <keyword>Bar</keyword>
    </keywords>
    <title>Indian Chief</title>
    </economist>

    <economist publications="true" >
    <name>
    <first>Jane</first>
    <last>Smith</last>
    </name>
    <keywords>
    <keyword>More Foo</keyword>
    <keyword>More Bar</keyword>
    </keywords>
    <title>President</title>
    </economist>
    </root>
    Ctrl-D
    [trwww@waveright trwww]$ ls -l
    total 24
    drwxr-xr-x 3 trwww trwww 4096 Aug 17 19:00 apps
    drwx------ 3 trwww trwww 4096 Sep 16 20:49 Desktop
    drwxr-xr-x 3 trwww trwww 4096 Aug 18 16:50 misc
    drwxrwxr-x 3 trwww trwww 4096 Sep 6 19:00 public_html
    -rw-rw-r-- 1 trwww trwww 297 Sep 17 22:56 record.0
    -rw-rw-r-- 1 trwww trwww 306 Sep 17 22:56 record.1
    [trwww@waveright trwww]$ cat record.0
    <economist publications="true">
    <name>
    <first>John</first>
    <last>Doe</last>
    </name>
    <keywords>
    <keyword>Foo</keyword>
    <keyword>Bar</keyword>
    </keywords>
    <title>Indian Chief</title>
    </economist>[trwww@waveright trwww]$ cat record.1
    <economist publications="true">
    <name>
    <first>Jane</first>
    <last>Smith</last>
    </name>
    <keywords>
    <keyword>More Foo</keyword>
    <keyword>More Bar</keyword>
    </keywords>
    <title>President</title>
    </economist>[trwww@waveright trwww]$

    Todd W.
    trwww Guest

  6. #5

    Default Re: Splitting up an XML File

    [email]toddrw69@excite.com[/email] (trwww) wrote in message news:<d81ecffa.0309171902.596dfa99@posting.google. com>...
    > [email]jeffg@programmer.net[/email] (JAG) wrote in message news:<6b40b6b9.0309171009.20d66b6a@posting.google. com>...
    > > I have an XML file that looks like this:
    > >
    > <snip />
    > >
    > > But the actual file has about 100 <economist> elements.
    > > I need to write some Perl code to parse this XML file and
    > > write out 100 smaller XML files, each file corresponding to one
    > > <economist> element.
    > >
    > > So in my example, I'd write 2 smaller files, one that
    > > looks like this:
    > <snip />
    > >
    > > There are some nested elements in the real file, so I think
    > > XML::Simple won't work for this.
    > >
    > > Any ideas about how I can do this? I don't need to do any processing
    > > (at least not now) - just reading and writing smaller chunks.
    > >
    >
    > This uses one of my favorite modules, XML::XPath:
    >
    > [trwww@waveright trwww]$ perl
    > use warnings;
    > use strict;
    > use XML::XPath;
    > use IO::File;
    >
    > my($xp) = XML::XPath->new( xml => join('', <DATA>) );
    > my($nodeset) = $xp->find( '/root/economist' );
    >
    > my($ext) = 0;
    >
    > foreach my $record ( $nodeset->get_nodelist() ) {
    > IO::File->new('> record.'.$ext++)->print($record->toString());
    > }
    >
    > __DATA__
    > <root>
    > <economist publications="true" >
    > <name>
    > <first>John</first>
    > <last>Doe</last>
    > </name>
    > <keywords>
    > <keyword>Foo</keyword>
    > <keyword>Bar</keyword>
    > </keywords>
    > <title>Indian Chief</title>
    > </economist>
    >
    > <economist publications="true" >
    > <name>
    > <first>Jane</first>
    > <last>Smith</last>
    > </name>
    > <keywords>
    > <keyword>More Foo</keyword>
    > <keyword>More Bar</keyword>
    > </keywords>
    > <title>President</title>
    > </economist>
    > </root>
    > Ctrl-D
    > [trwww@waveright trwww]$ ls -l
    > total 24
    > drwxr-xr-x 3 trwww trwww 4096 Aug 17 19:00 apps
    > drwx------ 3 trwww trwww 4096 Sep 16 20:49 Desktop
    > drwxr-xr-x 3 trwww trwww 4096 Aug 18 16:50 misc
    > drwxrwxr-x 3 trwww trwww 4096 Sep 6 19:00 public_html
    > -rw-rw-r-- 1 trwww trwww 297 Sep 17 22:56 record.0
    > -rw-rw-r-- 1 trwww trwww 306 Sep 17 22:56 record.1
    > [trwww@waveright trwww]$ cat record.0
    > <economist publications="true">
    > <name>
    > <first>John</first>
    > <last>Doe</last>
    > </name>
    > <keywords>
    > <keyword>Foo</keyword>
    > <keyword>Bar</keyword>
    > </keywords>
    > <title>Indian Chief</title>
    > </economist>[trwww@waveright trwww]$ cat record.1
    > <economist publications="true">
    > <name>
    > <first>Jane</first>
    > <last>Smith</last>
    > </name>
    > <keywords>
    > <keyword>More Foo</keyword>
    > <keyword>More Bar</keyword>
    > </keywords>
    > <title>President</title>
    > </economist>[trwww@waveright trwww]$
    >
    > Todd W.

    Thanks! This works beautifully.
    Now, here are two more things.

    Instead of naming the files record.[0..n], I want each
    output file to have the name of the person.
    So these two files would be named Jane.Smith and John.Doe

    Also, within each <economist> element, there is now an element
    called <work> that contains other elements. I need each of these
    <work> elements to be writtten to its own file called lastname_work
    and not in the first output file.

    So for this XML file:

    <root>
    <economist publications="true" >
    <name>
    <first>John</first>
    <last>Doe</last>
    </name>
    <keywords>
    <keyword>Foo</keyword>
    <keyword>Bar</keyword>
    </keywords>
    <title>Indian Chief</title>
    <work>
    <title>Title 1</title>
    <content>Some Content</content>
    </work>
    </economist>

    <economist publications="true" >
    <name>
    <first>Jane</first>
    <last>Smith</last>
    </name>
    <keywords>
    <keyword>More Foo</keyword>
    <keyword>More Bar</keyword>
    </keywords>
    <title>President</title>
    <work>
    <title>Title 2</title>
    <content>Some More Content</content>
    </work>
    </economist>

    So this would produce the same two files your original code produced,
    but named John.Doe and Jane.Smith and also without the <work> element.
    Instead of printing the work element in this file, it should be printed
    in its own file, in this case, called Smith_work and Doe_work.

    Thanks again.
    JAG Guest

  7. #6

    Default Re: Splitting up an XML File

    [email]jeffg@programmer.net[/email] (JAG) wrote in message news:<6b40b6b9.0309180733.6cdcb2c8@posting.google. com>...
    > [email]toddrw69@excite.com[/email] (trwww) wrote in message news:<d81ecffa.0309171902.596dfa99@posting.google. com>...
    > > [email]jeffg@programmer.net[/email] (JAG) wrote in message news:<6b40b6b9.0309171009.20d66b6a@posting.google. com>...
    > > > I have an XML file that looks like this:
    > > >
    <snip />
    > > >
    > > > There are some nested elements in the real file, so I think
    > > > XML::Simple won't work for this.
    > > >
    > > > Any ideas about how I can do this? I don't need to do any processing
    > > > (at least not now) - just reading and writing smaller chunks.
    > > >
    > >
    > > This uses one of my favorite modules, XML::XPath:
    > >
    > > [trwww@waveright trwww]$ perl
    > > use warnings;
    > > use strict;
    > > use XML::XPath;
    > > use IO::File;
    > >
    > > my($xp) = XML::XPath->new( xml => join('', <DATA>) );
    > > my($nodeset) = $xp->find( '/root/economist' );
    > >
    > > my($ext) = 0;
    > >
    > > foreach my $record ( $nodeset->get_nodelist() ) {
    > > IO::File->new('> record.'.$ext++)->print($record->toString());
    > > }
    > >
    > > __DATA__
    > > <root>
    <snip />
    >
    > Thanks! This works beautifully.
    of course =0)
    > Now, here are two more things.
    No thank you.
    >
    > Instead of naming the files record.[0..n], I want each
    > output file to have the name of the person.
    > So these two files would be named Jane.Smith and John.Doe
    >
    > Also, within each <economist> element, there is now an element
    <snip />
    >
    > So this would produce the same two files your original code produced,
    > but named John.Doe and Jane.Smith and also without the <work> element.
    > Instead of printing the work element in this file, it should be printed
    > in its own file, in this case, called Smith_work and Doe_work.
    >
    I replied to your post to show you and CLPM lurkers how easy
    XML::XPath is to use.

    If you need a consultant, email me off-list at [email]sendwade@hotmail.com[/email]

    Otherwise, read the XML::XPath documentation. What you propose above
    is trivial to implement with XPath.

    Todd W.
    trwww Guest

Posting Permissions

  • You may not post new threads
  • You may post replies
  • You may not post attachments
  • You may not edit your posts

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139