Ask a Question related to PERL Modules, Design and Development.
-
Angshuman Guin #1
Removing duplicate elements from an XML file
Is there an easy way of removing duplicate elements in an XML file. For
example if I have three elements like:
<incident id="GDOT-INC-252421" level="3" status="active">
<type id="8">Incident</type>
<location>
<description>Southbound SR 53 AT MARBLE HILL (SINKHOLE MP
25.3)</description>
</location>
</incident>
<incident id="GDOT-INC-252421" level="3" status="active">
<type id="8">Incident</type>
<location>
<description>Southbound SR 53</description>
</location>
</incident>
<incident id="GDOT-INC-252421" level="3" status="active">
<type id="8">Incident</type>
<location>
<description>Southbound SR 53 AT MARBLE HILL (SINKHOLE MP
25.3)</description>
</location>
</incident>
Elements 1 and 3 are completely identical. I would like to remove element 3
while leaving element 1 and 2 intact. Note that the <description> within
<location> for element 2 is different from the others.
I can figure out a way of doing this by using the incident id tag as the
delimiter and doing string comparisons but somehow feel that others would
have faced this problem earlier and there would be a nicer way of doing it.
Thanks in advance!
Angshuman Guin Guest
-
A basic question: Removing duplicate results from Max function
Hi, Say I have a table Job with columns name, date, salary . I want to get the name ,date and salary for the date when that person earned maximum... -
removing all elements in an array
How can I simply remove all elements in an array, given that the array is global and a procedure defines the elements to where the total number of... -
removing duplicate lines
I am writing a Perl script to automatically generate a netlogon.bat file for Samba whenever a user logs onto a domain. The only parameter that is... -
avoiding duplicate array elements
Im a beginner in PHP and Im having a problem with this code. Im trying to remove duplicate elements from an array created via $_GET. I want users... -
Removing elements from associate array.
Elo! I've got a problem with removing elements from associate array (php). Above you'll find a schematic structure of my array: ARRAY ------... -
Michel Rodriguez #2
Re: Removing duplicate elements from an XML file
Angshuman Guin wrote:
[...]> Is there an easy way of removing duplicate elements in an XML file. For
> example if I have three elements like:
>
>Here is a rather simplistic solution using XML::Twig and Digest::MD5,>
> Elements 1 and 3 are completely identical. I would like to remove element
> 3 while leaving element 1 and 2 intact. Note that the <description> within
> <location> for element 2 is different from the others.
>
> I can figure out a way of doing this by using the incident id tag as the
> delimiter and doing string comparisons but somehow feel that others would
> have faced this problem earlier and there would be a nicer way of doing
> it.
basically for each incident element the code computes the MD5 of the XML
and stores it, duplicate MD5s are removed. Note that as XML::Twig drops
non-significant whitespaces you don't need to have the 2 elements formated
exactly the same way.
#!/usr/bin/perl -w
use strict;
use XML::Twig;
use Digest::MD5 qw(md5);
# options, no processing done here at all
my @tags= qw(incident); # you can have one or more tags here
my @files= ( "incidents.xml"); # several files also allowed
# create handlers for each of the tags to check for duplicate
my %handlers= map { $_ => \&check_duplicate } @tags;
foreach my $file (@files)
{ (my $outfile= $file)=~ s{\.xml$}{.cleaned.xml}; # generate output file
name
open( OUT, ">$outfile") or die "cannot create output file $outfile: $!";
my $twig= XML::Twig->new( twig_roots => \%handlers,
pretty_print => 'indented'
)
->parsefile( $file);
# don't forget to flush, or the end of the file will be missing
$twig->flush( \*OUT);
close OUT;
}
{ my %md5; # md5 => 1, basically memoizes md5 for elements
sub check_duplicate
{ my( $t, $elt)= @_;
my $elt_text= $elt->sprint; # get the complete text, including tags
my $md5= md5($elt_text);
if( $md5{$md5})
{ $elt->delete; } # if md5 already found, remove element
else
{ $md5{$md5}=1; # store md5
$t->flush( \*OUT); # flush to limit mempry usage
}
}
}
I also have a style comment: it is weird in XML to have an attribute named
id... which is not a real ID (both incident and type have one of those). It
is not completely evil, it just doesn't fit with the usual XML conventions.
__
Michel Rodriguez
Perl & XML
[url]http://xmltwig.com[/url]
Michel Rodriguez Guest



Reply With Quote

