Removing duplicate elements from an XML file

Ask a Question related to PERL Modules, Design and Development.

  1. #1

    Default Removing duplicate elements from an XML file

    Is there an easy way of removing duplicate elements in an XML file. For
    example if I have three elements like:


    <incident id="GDOT-INC-252421" level="3" status="active">
    <type id="8">Incident</type>
    <location>
    <description>Southbound SR 53 AT MARBLE HILL (SINKHOLE MP
    25.3)</description>
    </location>
    </incident>
    <incident id="GDOT-INC-252421" level="3" status="active">
    <type id="8">Incident</type>
    <location>
    <description>Southbound SR 53</description>
    </location>
    </incident>
    <incident id="GDOT-INC-252421" level="3" status="active">
    <type id="8">Incident</type>
    <location>
    <description>Southbound SR 53 AT MARBLE HILL (SINKHOLE MP
    25.3)</description>
    </location>
    </incident>

    Elements 1 and 3 are completely identical. I would like to remove element 3
    while leaving element 1 and 2 intact. Note that the <description> within
    <location> for element 2 is different from the others.

    I can figure out a way of doing this by using the incident id tag as the
    delimiter and doing string comparisons but somehow feel that others would
    have faced this problem earlier and there would be a nicer way of doing it.

    Thanks in advance!


    Angshuman Guin Guest

  2. Similar Questions and Discussions

    1. A basic question: Removing duplicate results from Max function
      Hi, Say I have a table Job with columns name, date, salary . I want to get the name ,date and salary for the date when that person earned maximum...
    2. removing all elements in an array
      How can I simply remove all elements in an array, given that the array is global and a procedure defines the elements to where the total number of...
    3. removing duplicate lines
      I am writing a Perl script to automatically generate a netlogon.bat file for Samba whenever a user logs onto a domain. The only parameter that is...
    4. avoiding duplicate array elements
      Im a beginner in PHP and Im having a problem with this code. Im trying to remove duplicate elements from an array created via $_GET. I want users...
    5. Removing elements from associate array.
      Elo! I've got a problem with removing elements from associate array (php). Above you'll find a schematic structure of my array: ARRAY ------...
  3. #2

    Default Re: Removing duplicate elements from an XML file

    Angshuman Guin wrote:
    > Is there an easy way of removing duplicate elements in an XML file. For
    > example if I have three elements like:
    >
    >
    [...]
    >
    > Elements 1 and 3 are completely identical. I would like to remove element
    > 3 while leaving element 1 and 2 intact. Note that the <description> within
    > <location> for element 2 is different from the others.
    >
    > I can figure out a way of doing this by using the incident id tag as the
    > delimiter and doing string comparisons but somehow feel that others would
    > have faced this problem earlier and there would be a nicer way of doing
    > it.
    Here is a rather simplistic solution using XML::Twig and Digest::MD5,
    basically for each incident element the code computes the MD5 of the XML
    and stores it, duplicate MD5s are removed. Note that as XML::Twig drops
    non-significant whitespaces you don't need to have the 2 elements formated
    exactly the same way.

    #!/usr/bin/perl -w
    use strict;
    use XML::Twig;
    use Digest::MD5 qw(md5);

    # options, no processing done here at all
    my @tags= qw(incident); # you can have one or more tags here
    my @files= ( "incidents.xml"); # several files also allowed

    # create handlers for each of the tags to check for duplicate
    my %handlers= map { $_ => \&check_duplicate } @tags;


    foreach my $file (@files)
    { (my $outfile= $file)=~ s{\.xml$}{.cleaned.xml}; # generate output file
    name
    open( OUT, ">$outfile") or die "cannot create output file $outfile: $!";
    my $twig= XML::Twig->new( twig_roots => \%handlers,
    pretty_print => 'indented'
    )
    ->parsefile( $file);
    # don't forget to flush, or the end of the file will be missing
    $twig->flush( \*OUT);
    close OUT;
    }

    { my %md5; # md5 => 1, basically memoizes md5 for elements

    sub check_duplicate
    { my( $t, $elt)= @_;
    my $elt_text= $elt->sprint; # get the complete text, including tags
    my $md5= md5($elt_text);
    if( $md5{$md5})
    { $elt->delete; } # if md5 already found, remove element
    else
    { $md5{$md5}=1; # store md5
    $t->flush( \*OUT); # flush to limit mempry usage
    }
    }
    }


    I also have a style comment: it is weird in XML to have an attribute named
    id... which is not a real ID (both incident and type have one of those). It
    is not completely evil, it just doesn't fit with the usual XML conventions.

    __
    Michel Rodriguez
    Perl &amp; XML
    [url]http://xmltwig.com[/url]
    Michel Rodriguez Guest

Posting Permissions

  • You may not post new threads
  • You may post replies
  • You may not post attachments
  • You may not edit your posts

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139