Ask a Question related to PERL Beginners, Design and Development.
-
Andrew Hughes #1
complex data file parsing
I am trying to make sense of a comma delimited log file in which multiple
lines make up 1 record. Here is an example:
A,W29073,Thu Apr 05 15:25:08 2001
B,W29073,Scott,S,ser@aq.com,249 Tah Ave,,Sth San Francisco,CA,~US,55555-5555
P,W29073,
X,W29073,Company Name,A,Department Name,San Francisco 00),Purchase Order
Number,254
S,W29073,UPS Next Day Air,Scott S,2 Tah Ave,,Sth San
Francisco,CA,~US,55555-5555
I,W29073,AVHQ_101090lfbl,6.000,$28.50,$171.00,,,,1 .00,,2,0
I,W29073,AVHQ_101090xlfbl,4.000,$28.50,$114.00,,,, 1.00,,3,0
T,W29073,$285.00,,,,$53.09,$338.09,,10.00,
A,W29101,Wed Apr 11 07:43:33 2001
B,W29101,harold,m,HMA@masnc.net,10 wind ridge parkway,,Atlanta,GA,~US,55555
P,W29101,
X,W29101,Company Name,,Department Name,,Purchase Order Number,10252
S,W29101,UPS Regular Ground,harold m,10 wind ridge
parkway,,Atlanta,GA,~US,55555
I,W29101,ADV_Carb-Natxxl,1.000,$16.50,$16.50,,,,1.50,,4
T,W29101,$17.50,,7.000,$1.23,$9.28,$28.01,,1.50,
A,W29116,Thu Apr 12 11:42:21 2001
B,W29116,test,test,test@test.com,test,,test,GA,~US ,11111
P,W29116,Credit,Offline,Visa,4444444444444444,04/04,,,,
X,W29116,Company Name,,Department Name,,Purchase Order Number,
S,W29116,UPS Regular Ground,test test,test,,test,GA,~US,11111
I,W29116,ADV_1601,1.000,$14.00,$14.00,,,,1.50,,3
T,W29116,$14.00,,7.000,$0.98,$9.94,$24.92,,1.50,
Here's what I know:
I am trying to get a list of email addresses for people who have ordered
products that begin with ADV
I know that the second field is the order number that ties all of the lines
for one order together.
I know that each block always starts with and A in the first position of the
first line and ends with a T in the last position of the last line.
I know that the second line starts with a B, and the data in the 5th space
on this line is the e-mail address, which is what I ultimately want.
However,...
I am trying to get a list of email addresses for people who have ordered
products that begin with ADV. These can appear in multiple I lines.
Therefore you can never predict how many lines make up 1 order block.
I can handle all of the pieces except for the parsing of files that have
each complete record on its own line. The problems is that the records are
split across multiple line and the # of lines can increase based on how many
line items (I rows) there are on the order
Can anyone offer me some direction? Should I try to leave these lines
separate? Should I try to start each line with A and then put each of the
subsequent lines end to end until I hit another "A?"
Thanks,
Andrew
Andrew Hughes
Insider's Advantage
Webmaster
Phone: (404) 575-6389
Fax: (404) 575-6374
"Online ordering is now available. Visit [url]http://insidersadvantage.com[/url] for
details."
Andrew Hughes Guest
-
Flash - C# Web Service parsing complex result
Hi, I have C# web service which returns return new object { int , string} for example return new object {55,"some text"} -
complex data types from PHP to Flash
Does anyone know of any really easy and effective ways to move complex data types from PHP into FLASH and vice-versa? A while back I made an... -
Send complex data as object
Hi all, I want to make sure if I am in the right way. Because once I executed my application and press on Generate Button it gives me the... -
Complex XML help with Data Binding
Ok. I load an external XML file from a remote server to my movie. I use a dummy XML to define the structure for the XML connector. I checked the... -
Regular expressions, parsing data file
Ok, no surprise, but I have having trouble figuring out regular expressions. I want to parse a data file in perl to find a mac address of a... -
Wolf Blaum #2
Re: complex data file parsing
hi,
isnt it a T in the first position of the last row of the set?> I know that each block always starts with and A in the first position of
> the first line and ends with a T in the last position of the last line.
only line with a B in the bigining in set?> I know that the second line starts with a B, and the data in the 5th space
> on this line is the e-mail address, which is what I ultimately want.
> However,...
What about:> I am trying to get a list of email addresses for people who have ordered
> products that begin with ADV. These can appear in multiple I lines.
> Therefore you can never predict how many lines make up 1 order block.
#! /usr/bin/perl
use strict;
use warnings;
my @email;
open (FH, "<complex.txt") or die "$!";
local $/ = "\nA,"; # make \nA, the record seperator
while(<FH>){ # read the next record
my @fields = split ",|\n", $_; # split at , or \n
my $b_index; # 0 for every new record
for (my $i=0; $i<=$#fields; $i++){
if ($fields[$i] eq "B") {$b_index=$i; next;}
elsif ($fields[$i] =~ /^ADV_.*/) {push @email, $fields[$b_index+4];
last;}
}
}
works on the sample you provided.
$/ (see perlvar) is the record seperator, usually \n.
If really T would be the last char i the last row of the set, you could use "T
\n" as $/
The way I do it assumes that the first and only first line of each set beginns
with an A (and falsly buts that A at the end of the privious record, but
doesnt matter for the aim her, does it?)
The push assumes that there are always exactly 5 records between B and email
and that this is the only line with a B in record (and comes before the lines
with ADV_
lot of assumtions.
Im sure there is better ways to do that - might be a strat, though.
Uh, given from your question, I better dont,, eh?> "Online ordering is now available. Visit [url]http://insidersadvantage.com[/url] for
> details."
Good luck, Wolf
Wolf Blaum Guest
-
Andrew Hughes #3
RE: complex data file parsing
Thanks for the information. That was much more than I expected.
You right about the T line. That was a typo. The T is in the firth
position of the last line of each order block.
As far as your follow up question on the B lines, "only line with a B in the
beginning in set?," I'm not sure if I understand. If you mean that there
will only be 1 line per order (set of lines A-T) with a B in the first
position, you are correct.
Also, as far as your assumption, "The way I do it assumes that the first and
only first line of each set beginns with an A (and falsly buts that A at the
end of the privious record, but
doesnt matter for the aim her, does it?)," I'm not sure what you mean by
this either. However, it sounds like you have it correct. Lines that
indicate the beginning of an order block, will only ever start with an A in
the first position.
Finally, the final assumption, that "The push assumes that there are always
exactly 5 records between B and email and that this is the only line with a
B in record (and comes before the lines
with ADV_". I think that this is correct. An example line is
"B,W29116,test,test,test@test.com," The positions are 0,1,2,3,4, so that
equals 5, and it will ALWAYS be five. Finally, the B line will ALWAYS come
before the ADV_ lines. This appears to be correct judging that the output
of the script is e-mail addresses.
I tested the script, and I was able to output e-mail addresses. However,
using the data that I posted, it does not quite output exactly what I need.
Based on this sample of order.csv and the script that you sent me (I added
the line "print @email" to view the output):
for (my $i=0; $i<=$#fields; $i++){
if ($fields[$i] eq "B") {$b_index=$i; next;}
elsif ($fields[$i] =~ /^ADV_.*/) {push @email, $fields[$b_index+4];
last;}
print @email;
):
A,W29073,Thu Apr 05 15:25:08 2001
B,W29073,Scott,S,ser@aq.com,249 Tah Ave,,Sth San Francisco,CA,~US,55555-5555
P,W29073,
X,W29073,Company Name,A,Department Name,San Francisco 00),Purchase Order
Number,254
S,W29073,UPS Next Day Air,Scott S,2 Tah Ave,,Sth San
Francisco,CA,~US,55555-5555
I,W29073,AVHQ_101090lfbl,6.000,$28.50,$171.00,,,,1 .00,,2,0
I,W29073,AVHQ_101090xlfbl,4.000,$28.50,$114.00,,,, 1.00,,3,0
T,W29073,$285.00,,,,$53.09,$338.09,,10.00,
A,W29101,Wed Apr 11 07:43:33 2001
B,W29101,harold,m,HMA@masnc.net,10 wind ridge parkway,,Atlanta,GA,~US,55555
P,W29101,
X,W29101,Company Name,,Department Name,,Purchase Order Number,10252
S,W29101,UPS Regular Ground,harold m,10 wind ridge
parkway,,Atlanta,GA,~US,55555
I,W29101,ADV_Carb-Natxxl,1.000,$16.50,$16.50,,,,1.50,,4
T,W29101,$17.50,,7.000,$1.23,$9.28,$28.01,,1.50,
A,W29116,Thu Apr 12 11:42:21 2001
B,W29116,test,test,test@test.com,test,,test,GA,~US ,11111
P,W29116,Credit,Offline,Visa,4444444444444444,04/04,,,,
X,W29116,Company Name,,Department Name,,Purchase Order Number,
S,W29116,UPS Regular Ground,test test,test,,test,GA,~US,11111
I,W29116,ADV_1601,1.000,$14.00,$14.00,,,,1.50,,3
T,W29116,$14.00,,7.000,$0.98,$9.94,$24.92,,1.50,
I would expect to see:
[email]HMA@masnc.nettest@test.com[/email]
However, I see:
[email]HMA@masnc.netHMA@masnc.netHMA@masnc.netHMA@masnc.n etHMA@masnc.netH[/email]MA@masnc.n
[email]etHMA@masnc.netHMA@masnc.netHMA@masnc.netHMA@masnc .netHMA@masnc.netH[/email]MA@masnc
..netHMA@masnc.netHMA@masnc.netHMA@masnc.netHMA@ma snc.netHMA@masnc.netHMA@mas
[email]nc.netHMA@masnc.netHMA@masnc.netHMA@masnc.netHMA@m asnc.netHMA@masnc.netH[/email]MA@m
[email]asnc.netHMA@masnc.netHMA@masnc.netHMA@masnc.netHMA @masnc.netHMA@masnc.netH[/email]MA
@masnc.netHMA@masnc.netHMA@masnc.netHMA@masnc.netH MA@masnc.netHMA@masnc.netH
[email]MA@masnc.netHMA@masnc.netHMA@masnc.netHMA@masnc.ne tHMA@masnc.netHMA@masnc.ne[/email]
[email]tHMA@masnc.netHMA@masnc.net[/email]
What is going wrong? Am I trying to view the output incorrectly?
Thanks for any additional direction.
Andrew
-----Original Message-----
From: wolf blaum [mailto:wolf.blaum@charite.de]
Sent: Thursday, January 22, 2004 3:28 PM
To: Hughes, Andrew; Perl Beginners Mailing List
Subject: Re: complex data file parsing
hi,isnt it a T in the first position of the last row of the set?> I know that each block always starts with and A in the first position of
> the first line and ends with a T in the last position of the last line.
only line with a B in the bigining in set?> I know that the second line starts with a B, and the data in the 5th space
> on this line is the e-mail address, which is what I ultimately want.
> However,...
What about:> I am trying to get a list of email addresses for people who have ordered
> products that begin with ADV. These can appear in multiple I lines.
> Therefore you can never predict how many lines make up 1 order block.
#! /usr/bin/perl
use strict;
use warnings;
my @email;
open (FH, "<complex.txt") or die "$!";
local $/ = "\nA,"; # make \nA, the record seperator
while(<FH>){ # read the next record
my @fields = split ",|\n", $_; # split at , or \n
my $b_index; # 0 for every new record
for (my $i=0; $i<=$#fields; $i++){
if ($fields[$i] eq "B") {$b_index=$i; next;}
elsif ($fields[$i] =~ /^ADV_.*/) {push @email, $fields[$b_index+4];
last;}
}
}
works on the sample you provided.
$/ (see perlvar) is the record seperator, usually \n.
If really T would be the last char i the last row of the set, you could use
"T
\n" as $/
The way I do it assumes that the first and only first line of each set
beginns
with an A (and falsly buts that A at the end of the privious record, but
doesnt matter for the aim her, does it?)
The push assumes that there are always exactly 5 records between B and email
and that this is the only line with a B in record (and comes before the
lines
with ADV_
lot of assumtions.
Im sure there is better ways to do that - might be a strat, though.
Uh, given from your question, I better dont,, eh?> "Online ordering is now available. Visit [url]http://insidersadvantage.com[/url] for
> details."
Good luck, Wolf
Andrew Hughes Guest
-
Wolf Blaum #4
Re: complex data file parsing
Hi,
yes, thats what I meant.> As far as your follow up question on the B lines, "only line with a B in
> the beginning in set?," I'm not sure if I understand. If you mean that
> there will only be 1 line per order (set of lines A-T) with a B in the
> first position, you are correct.
Sorry about my lazyness. Adittionally I get to correct all my embarassing
typos...
Well, what that $/="\nA" does is, it changes the amount of data the while> Also, as far as your assumption, "The way I do it assumes that the first
> and only first line of each set beginns with an A (and falsly buts that A
> at the end of the privious record, but
> doesnt matter for the aim her, does it?)," I'm not sure what you mean by
> this either. However, it sounds like you have it correct. Lines that
> indicate the beginning of an order block, will only ever start with an A in
> the first position.
(<FH>) reads into $_
Usually that is a line - in your case, the change of $/ gets it to read a
whole order into $_: from A,.... to T,..... end of line here. Thats what you
need. However, I cheat: it acctually reads from A,... to T,.... \nA, into $_,
so even the (A,) belongs to the next record, it ends up in the privious one.
Thats kind of wrong, given your record structure but does not matter for the
purpous you described. See the print $_ in the code below.
well good:)> Finally, the final assumption, that "The push assumes that there are always
> exactly 5 records between B and email and that this is the only line with a
> B in record (and comes before the lines
> with ADV_". I think that this is correct.
1> print @email;> I tested the script, and I was able to output e-mail addresses. However,
> using the data that I posted, it does not quite output exactly what I need.
> Based on this sample of order.csv and the script that you sent me (I added
> the line "print @email" to view the output):
>
> for (my $i=0; $i<=$#fields; $i++){
> if ($fields[$i] eq "B") {$b_index=$i; next;}
> elsif ($fields[$i] =~ /^ADV_.*/) {push @email, $fields[$b_index+4];
> last;}> ):The line 1 is still in the for loop. So you print all emails seen so far for>
> What is going wrong? Am I trying to view the output incorrectly?
every field the split gave you.
Code with more debug in the right place:
---------------------------
#! /usr/bin/perl
use strict;
use warnings;
my @email;
open (FH, "<complex.txt") or die "$!";
local $/ = "\nA,"; # make \nA, the record seperator
while(<FH>){ # read the next record
print "This record holdes:\n$_ \n";
my @fields = split ",|\n", $_; # split at , or \n
my $b_index; # 0 for every new record
for (my $i=0; $i<=$#fields; $i++){
if ($fields[$i] eq "B") {$b_index=$i; next;}
elsif ($fields[$i] =~ /^ADV_.*/) {push @email, $fields[$b_index+4];
last;}
} # end for
print "End of record.\n\n"
} # end while
print "@email"; #last line in script
-----------------------------
On my box that prints the 2 emails you wanted.
I hope I didnt get something totally screwed.
Let me know if that does it or not. Thx,
Wolf
Wolf Blaum Guest
-
Andrew Hughes #5
RE: complex data file parsing
Thanks so much. I've been tinkering around with this all afternoon. I
think that it is there. I'm going to mess around with it more over the
weekend.
I'll let you know how it goes.
Thanks so much, Wolf!
Andrew
-----Original Message-----
From: wolf blaum [mailto:blaum@uthscsa.edu]
Sent: Friday, January 23, 2004 3:30 PM
To: Hughes, Andrew; Perl Beginners Mailing List
Subject: Re: complex data file parsing
Hi,
yes, thats what I meant.> As far as your follow up question on the B lines, "only line with a B in
> the beginning in set?," I'm not sure if I understand. If you mean that
> there will only be 1 line per order (set of lines A-T) with a B in the
> first position, you are correct.
Sorry about my lazyness. Adittionally I get to correct all my embarassing
typos...
in> Also, as far as your assumption, "The way I do it assumes that the first
> and only first line of each set beginns with an A (and falsly buts that A
> at the end of the privious record, but
> doesnt matter for the aim her, does it?)," I'm not sure what you mean by
> this either. However, it sounds like you have it correct. Lines that
> indicate the beginning of an order block, will only ever start with an AWell, what that $/="\nA" does is, it changes the amount of data the while> the first position.
(<FH>) reads into $_
Usually that is a line - in your case, the change of $/ gets it to read a
whole order into $_: from A,.... to T,..... end of line here. Thats what you
need. However, I cheat: it acctually reads from A,... to T,.... \nA, into
$_,
so even the (A,) belongs to the next record, it ends up in the privious one.
Thats kind of wrong, given your record structure but does not matter for the
purpous you described. See the print $_ in the code below.
always> Finally, the final assumption, that "The push assumes that there area> exactly 5 records between B and email and that this is the only line withwell good:)> B in record (and comes before the lines
> with ADV_". I think that this is correct.
need.> I tested the script, and I was able to output e-mail addresses. However,
> using the data that I posted, it does not quite output exactly what I1> print @email;> Based on this sample of order.csv and the script that you sent me (I added
> the line "print @email" to view the output):
>
> for (my $i=0; $i<=$#fields; $i++){
> if ($fields[$i] eq "B") {$b_index=$i; next;}
> elsif ($fields[$i] =~ /^ADV_.*/) {push @email, $fields[$b_index+4];
> last;}> ):The line 1 is still in the for loop. So you print all emails seen so far for>
> What is going wrong? Am I trying to view the output incorrectly?
every field the split gave you.
Code with more debug in the right place:
---------------------------
#! /usr/bin/perl
use strict;
use warnings;
my @email;
open (FH, "<complex.txt") or die "$!";
local $/ = "\nA,"; # make \nA, the record seperator
while(<FH>){ # read the next record
print "This record holdes:\n$_ \n";
my @fields = split ",|\n", $_; # split at , or \n
my $b_index; # 0 for every new record
for (my $i=0; $i<=$#fields; $i++){
if ($fields[$i] eq "B") {$b_index=$i; next;}
elsif ($fields[$i] =~ /^ADV_.*/) {push @email, $fields[$b_index+4];
last;}
} # end for
print "End of record.\n\n"
} # end while
print "@email"; #last line in script
-----------------------------
On my box that prints the 2 emails you wanted.
I hope I didnt get something totally screwed.
Let me know if that does it or not. Thx,
Wolf
Andrew Hughes Guest



Reply With Quote

