complex data file parsing

Ask a Question related to PERL Beginners, Design and Development.

  1. #1

    Default complex data file parsing

    I am trying to make sense of a comma delimited log file in which multiple
    lines make up 1 record. Here is an example:

    A,W29073,Thu Apr 05 15:25:08 2001
    B,W29073,Scott,S,ser@aq.com,249 Tah Ave,,Sth San Francisco,CA,~US,55555-5555
    P,W29073,
    X,W29073,Company Name,A,Department Name,San Francisco 00),Purchase Order
    Number,254
    S,W29073,UPS Next Day Air,Scott S,2 Tah Ave,,Sth San
    Francisco,CA,~US,55555-5555
    I,W29073,AVHQ_101090lfbl,6.000,$28.50,$171.00,,,,1 .00,,2,0
    I,W29073,AVHQ_101090xlfbl,4.000,$28.50,$114.00,,,, 1.00,,3,0
    T,W29073,$285.00,,,,$53.09,$338.09,,10.00,
    A,W29101,Wed Apr 11 07:43:33 2001
    B,W29101,harold,m,HMA@masnc.net,10 wind ridge parkway,,Atlanta,GA,~US,55555
    P,W29101,
    X,W29101,Company Name,,Department Name,,Purchase Order Number,10252
    S,W29101,UPS Regular Ground,harold m,10 wind ridge
    parkway,,Atlanta,GA,~US,55555
    I,W29101,ADV_Carb-Natxxl,1.000,$16.50,$16.50,,,,1.50,,4
    T,W29101,$17.50,,7.000,$1.23,$9.28,$28.01,,1.50,
    A,W29116,Thu Apr 12 11:42:21 2001
    B,W29116,test,test,test@test.com,test,,test,GA,~US ,11111
    P,W29116,Credit,Offline,Visa,4444444444444444,04/04,,,,
    X,W29116,Company Name,,Department Name,,Purchase Order Number,
    S,W29116,UPS Regular Ground,test test,test,,test,GA,~US,11111
    I,W29116,ADV_1601,1.000,$14.00,$14.00,,,,1.50,,3
    T,W29116,$14.00,,7.000,$0.98,$9.94,$24.92,,1.50,

    Here's what I know:

    I am trying to get a list of email addresses for people who have ordered
    products that begin with ADV

    I know that the second field is the order number that ties all of the lines
    for one order together.

    I know that each block always starts with and A in the first position of the
    first line and ends with a T in the last position of the last line.

    I know that the second line starts with a B, and the data in the 5th space
    on this line is the e-mail address, which is what I ultimately want.
    However,...

    I am trying to get a list of email addresses for people who have ordered
    products that begin with ADV. These can appear in multiple I lines.
    Therefore you can never predict how many lines make up 1 order block.

    I can handle all of the pieces except for the parsing of files that have
    each complete record on its own line. The problems is that the records are
    split across multiple line and the # of lines can increase based on how many
    line items (I rows) there are on the order

    Can anyone offer me some direction? Should I try to leave these lines
    separate? Should I try to start each line with A and then put each of the
    subsequent lines end to end until I hit another "A?"

    Thanks,
    Andrew

    Andrew Hughes
    Insider's Advantage
    Webmaster
    Phone: (404) 575-6389
    Fax: (404) 575-6374

    "Online ordering is now available. Visit [url]http://insidersadvantage.com[/url] for
    details."
    Andrew Hughes Guest

  2. Similar Questions and Discussions

    1. Flash - C# Web Service parsing complex result
      Hi, I have C# web service which returns return new object { int , string} for example return new object {55,"some text"}
    2. complex data types from PHP to Flash
      Does anyone know of any really easy and effective ways to move complex data types from PHP into FLASH and vice-versa? A while back I made an...
    3. Send complex data as object
      Hi all, I want to make sure if I am in the right way. Because once I executed my application and press on Generate Button it gives me the...
    4. Complex XML help with Data Binding
      Ok. I load an external XML file from a remote server to my movie. I use a dummy XML to define the structure for the XML connector. I checked the...
    5. Regular expressions, parsing data file
      Ok, no surprise, but I have having trouble figuring out regular expressions. I want to parse a data file in perl to find a mac address of a...
  3. #2

    Default Re: complex data file parsing

    hi,
    > I know that each block always starts with and A in the first position of
    > the first line and ends with a T in the last position of the last line.
    isnt it a T in the first position of the last row of the set?
    > I know that the second line starts with a B, and the data in the 5th space
    > on this line is the e-mail address, which is what I ultimately want.
    > However,...
    only line with a B in the bigining in set?
    > I am trying to get a list of email addresses for people who have ordered
    > products that begin with ADV. These can appear in multiple I lines.
    > Therefore you can never predict how many lines make up 1 order block.
    What about:

    #! /usr/bin/perl
    use strict;
    use warnings;
    my @email;

    open (FH, "<complex.txt") or die "$!";

    local $/ = "\nA,"; # make \nA, the record seperator

    while(<FH>){ # read the next record
    my @fields = split ",|\n", $_; # split at , or \n
    my $b_index; # 0 for every new record
    for (my $i=0; $i<=$#fields; $i++){
    if ($fields[$i] eq "B") {$b_index=$i; next;}
    elsif ($fields[$i] =~ /^ADV_.*/) {push @email, $fields[$b_index+4];
    last;}
    }
    }

    works on the sample you provided.

    $/ (see perlvar) is the record seperator, usually \n.

    If really T would be the last char i the last row of the set, you could use "T
    \n" as $/
    The way I do it assumes that the first and only first line of each set beginns
    with an A (and falsly buts that A at the end of the privious record, but
    doesnt matter for the aim her, does it?)


    The push assumes that there are always exactly 5 records between B and email
    and that this is the only line with a B in record (and comes before the lines
    with ADV_

    lot of assumtions.

    Im sure there is better ways to do that - might be a strat, though.
    > "Online ordering is now available. Visit [url]http://insidersadvantage.com[/url] for
    > details."
    Uh, given from your question, I better dont,, eh?

    Good luck, Wolf

    Wolf Blaum Guest

  4. #3

    Default RE: complex data file parsing

    Thanks for the information. That was much more than I expected.

    You right about the T line. That was a typo. The T is in the firth
    position of the last line of each order block.

    As far as your follow up question on the B lines, "only line with a B in the
    beginning in set?," I'm not sure if I understand. If you mean that there
    will only be 1 line per order (set of lines A-T) with a B in the first
    position, you are correct.

    Also, as far as your assumption, "The way I do it assumes that the first and
    only first line of each set beginns with an A (and falsly buts that A at the
    end of the privious record, but
    doesnt matter for the aim her, does it?)," I'm not sure what you mean by
    this either. However, it sounds like you have it correct. Lines that
    indicate the beginning of an order block, will only ever start with an A in
    the first position.

    Finally, the final assumption, that "The push assumes that there are always
    exactly 5 records between B and email and that this is the only line with a
    B in record (and comes before the lines
    with ADV_". I think that this is correct. An example line is
    "B,W29116,test,test,test@test.com," The positions are 0,1,2,3,4, so that
    equals 5, and it will ALWAYS be five. Finally, the B line will ALWAYS come
    before the ADV_ lines. This appears to be correct judging that the output
    of the script is e-mail addresses.

    I tested the script, and I was able to output e-mail addresses. However,
    using the data that I posted, it does not quite output exactly what I need.
    Based on this sample of order.csv and the script that you sent me (I added
    the line "print @email" to view the output):

    for (my $i=0; $i<=$#fields; $i++){
    if ($fields[$i] eq "B") {$b_index=$i; next;}
    elsif ($fields[$i] =~ /^ADV_.*/) {push @email, $fields[$b_index+4];
    last;}
    print @email;
    ):

    A,W29073,Thu Apr 05 15:25:08 2001
    B,W29073,Scott,S,ser@aq.com,249 Tah Ave,,Sth San Francisco,CA,~US,55555-5555
    P,W29073,
    X,W29073,Company Name,A,Department Name,San Francisco 00),Purchase Order
    Number,254
    S,W29073,UPS Next Day Air,Scott S,2 Tah Ave,,Sth San
    Francisco,CA,~US,55555-5555
    I,W29073,AVHQ_101090lfbl,6.000,$28.50,$171.00,,,,1 .00,,2,0
    I,W29073,AVHQ_101090xlfbl,4.000,$28.50,$114.00,,,, 1.00,,3,0
    T,W29073,$285.00,,,,$53.09,$338.09,,10.00,
    A,W29101,Wed Apr 11 07:43:33 2001
    B,W29101,harold,m,HMA@masnc.net,10 wind ridge parkway,,Atlanta,GA,~US,55555
    P,W29101,
    X,W29101,Company Name,,Department Name,,Purchase Order Number,10252
    S,W29101,UPS Regular Ground,harold m,10 wind ridge
    parkway,,Atlanta,GA,~US,55555
    I,W29101,ADV_Carb-Natxxl,1.000,$16.50,$16.50,,,,1.50,,4
    T,W29101,$17.50,,7.000,$1.23,$9.28,$28.01,,1.50,
    A,W29116,Thu Apr 12 11:42:21 2001
    B,W29116,test,test,test@test.com,test,,test,GA,~US ,11111
    P,W29116,Credit,Offline,Visa,4444444444444444,04/04,,,,
    X,W29116,Company Name,,Department Name,,Purchase Order Number,
    S,W29116,UPS Regular Ground,test test,test,,test,GA,~US,11111
    I,W29116,ADV_1601,1.000,$14.00,$14.00,,,,1.50,,3
    T,W29116,$14.00,,7.000,$0.98,$9.94,$24.92,,1.50,

    I would expect to see:

    [email]HMA@masnc.nettest@test.com[/email]

    However, I see:

    [email]HMA@masnc.netHMA@masnc.netHMA@masnc.netHMA@masnc.n etHMA@masnc.netH[/email]MA@masnc.n
    [email]etHMA@masnc.netHMA@masnc.netHMA@masnc.netHMA@masnc .netHMA@masnc.netH[/email]MA@masnc
    ..netHMA@masnc.netHMA@masnc.netHMA@masnc.netHMA@ma snc.netHMA@masnc.netHMA@mas
    [email]nc.netHMA@masnc.netHMA@masnc.netHMA@masnc.netHMA@m asnc.netHMA@masnc.netH[/email]MA@m
    [email]asnc.netHMA@masnc.netHMA@masnc.netHMA@masnc.netHMA @masnc.netHMA@masnc.netH[/email]MA
    @masnc.netHMA@masnc.netHMA@masnc.netHMA@masnc.netH MA@masnc.netHMA@masnc.netH
    [email]MA@masnc.netHMA@masnc.netHMA@masnc.netHMA@masnc.ne tHMA@masnc.netHMA@masnc.ne[/email]
    [email]tHMA@masnc.netHMA@masnc.net[/email]

    What is going wrong? Am I trying to view the output incorrectly?

    Thanks for any additional direction.

    Andrew



    -----Original Message-----
    From: wolf blaum [mailto:wolf.blaum@charite.de]
    Sent: Thursday, January 22, 2004 3:28 PM
    To: Hughes, Andrew; Perl Beginners Mailing List
    Subject: Re: complex data file parsing


    hi,
    > I know that each block always starts with and A in the first position of
    > the first line and ends with a T in the last position of the last line.
    isnt it a T in the first position of the last row of the set?
    > I know that the second line starts with a B, and the data in the 5th space
    > on this line is the e-mail address, which is what I ultimately want.
    > However,...
    only line with a B in the bigining in set?
    > I am trying to get a list of email addresses for people who have ordered
    > products that begin with ADV. These can appear in multiple I lines.
    > Therefore you can never predict how many lines make up 1 order block.
    What about:

    #! /usr/bin/perl
    use strict;
    use warnings;
    my @email;

    open (FH, "<complex.txt") or die "$!";

    local $/ = "\nA,"; # make \nA, the record seperator

    while(<FH>){ # read the next record
    my @fields = split ",|\n", $_; # split at , or \n
    my $b_index; # 0 for every new record
    for (my $i=0; $i<=$#fields; $i++){
    if ($fields[$i] eq "B") {$b_index=$i; next;}
    elsif ($fields[$i] =~ /^ADV_.*/) {push @email, $fields[$b_index+4];
    last;}
    }
    }

    works on the sample you provided.

    $/ (see perlvar) is the record seperator, usually \n.

    If really T would be the last char i the last row of the set, you could use
    "T
    \n" as $/
    The way I do it assumes that the first and only first line of each set
    beginns
    with an A (and falsly buts that A at the end of the privious record, but
    doesnt matter for the aim her, does it?)


    The push assumes that there are always exactly 5 records between B and email

    and that this is the only line with a B in record (and comes before the
    lines
    with ADV_

    lot of assumtions.

    Im sure there is better ways to do that - might be a strat, though.
    > "Online ordering is now available. Visit [url]http://insidersadvantage.com[/url] for
    > details."
    Uh, given from your question, I better dont,, eh?

    Good luck, Wolf
    Andrew Hughes Guest

  5. #4

    Default Re: complex data file parsing

    Hi,
    > As far as your follow up question on the B lines, "only line with a B in
    > the beginning in set?," I'm not sure if I understand. If you mean that
    > there will only be 1 line per order (set of lines A-T) with a B in the
    > first position, you are correct.
    yes, thats what I meant.
    Sorry about my lazyness. Adittionally I get to correct all my embarassing
    typos...
    > Also, as far as your assumption, "The way I do it assumes that the first
    > and only first line of each set beginns with an A (and falsly buts that A
    > at the end of the privious record, but
    > doesnt matter for the aim her, does it?)," I'm not sure what you mean by
    > this either. However, it sounds like you have it correct. Lines that
    > indicate the beginning of an order block, will only ever start with an A in
    > the first position.
    Well, what that $/="\nA" does is, it changes the amount of data the while
    (<FH>) reads into $_
    Usually that is a line - in your case, the change of $/ gets it to read a
    whole order into $_: from A,.... to T,..... end of line here. Thats what you
    need. However, I cheat: it acctually reads from A,... to T,.... \nA, into $_,
    so even the (A,) belongs to the next record, it ends up in the privious one.
    Thats kind of wrong, given your record structure but does not matter for the
    purpous you described. See the print $_ in the code below.
    > Finally, the final assumption, that "The push assumes that there are always
    > exactly 5 records between B and email and that this is the only line with a
    > B in record (and comes before the lines
    > with ADV_". I think that this is correct.
    well good:)
    > I tested the script, and I was able to output e-mail addresses. However,
    > using the data that I posted, it does not quite output exactly what I need.
    > Based on this sample of order.csv and the script that you sent me (I added
    > the line "print @email" to view the output):
    >
    > for (my $i=0; $i<=$#fields; $i++){
    > if ($fields[$i] eq "B") {$b_index=$i; next;}
    > elsif ($fields[$i] =~ /^ADV_.*/) {push @email, $fields[$b_index+4];
    > last;}
    1> print @email;
    > ):
    >
    > What is going wrong? Am I trying to view the output incorrectly?
    The line 1 is still in the for loop. So you print all emails seen so far for
    every field the split gave you.

    Code with more debug in the right place:

    ---------------------------

    #! /usr/bin/perl
    use strict;
    use warnings;

    my @email;
    open (FH, "<complex.txt") or die "$!";

    local $/ = "\nA,"; # make \nA, the record seperator

    while(<FH>){ # read the next record
    print "This record holdes:\n$_ \n";

    my @fields = split ",|\n", $_; # split at , or \n
    my $b_index; # 0 for every new record
    for (my $i=0; $i<=$#fields; $i++){
    if ($fields[$i] eq "B") {$b_index=$i; next;}
    elsif ($fields[$i] =~ /^ADV_.*/) {push @email, $fields[$b_index+4];
    last;}
    } # end for

    print "End of record.\n\n"
    } # end while

    print "@email"; #last line in script

    -----------------------------

    On my box that prints the 2 emails you wanted.
    I hope I didnt get something totally screwed.

    Let me know if that does it or not. Thx,
    Wolf



    Wolf Blaum Guest

  6. #5

    Default RE: complex data file parsing

    Thanks so much. I've been tinkering around with this all afternoon. I
    think that it is there. I'm going to mess around with it more over the
    weekend.

    I'll let you know how it goes.

    Thanks so much, Wolf!

    Andrew

    -----Original Message-----
    From: wolf blaum [mailto:blaum@uthscsa.edu]
    Sent: Friday, January 23, 2004 3:30 PM
    To: Hughes, Andrew; Perl Beginners Mailing List
    Subject: Re: complex data file parsing


    Hi,
    > As far as your follow up question on the B lines, "only line with a B in
    > the beginning in set?," I'm not sure if I understand. If you mean that
    > there will only be 1 line per order (set of lines A-T) with a B in the
    > first position, you are correct.
    yes, thats what I meant.
    Sorry about my lazyness. Adittionally I get to correct all my embarassing
    typos...
    > Also, as far as your assumption, "The way I do it assumes that the first
    > and only first line of each set beginns with an A (and falsly buts that A
    > at the end of the privious record, but
    > doesnt matter for the aim her, does it?)," I'm not sure what you mean by
    > this either. However, it sounds like you have it correct. Lines that
    > indicate the beginning of an order block, will only ever start with an A
    in
    > the first position.
    Well, what that $/="\nA" does is, it changes the amount of data the while
    (<FH>) reads into $_
    Usually that is a line - in your case, the change of $/ gets it to read a
    whole order into $_: from A,.... to T,..... end of line here. Thats what you

    need. However, I cheat: it acctually reads from A,... to T,.... \nA, into
    $_,
    so even the (A,) belongs to the next record, it ends up in the privious one.

    Thats kind of wrong, given your record structure but does not matter for the

    purpous you described. See the print $_ in the code below.
    > Finally, the final assumption, that "The push assumes that there are
    always
    > exactly 5 records between B and email and that this is the only line with
    a
    > B in record (and comes before the lines
    > with ADV_". I think that this is correct.
    well good:)
    > I tested the script, and I was able to output e-mail addresses. However,
    > using the data that I posted, it does not quite output exactly what I
    need.
    > Based on this sample of order.csv and the script that you sent me (I added
    > the line "print @email" to view the output):
    >
    > for (my $i=0; $i<=$#fields; $i++){
    > if ($fields[$i] eq "B") {$b_index=$i; next;}
    > elsif ($fields[$i] =~ /^ADV_.*/) {push @email, $fields[$b_index+4];
    > last;}
    1> print @email;
    > ):
    >
    > What is going wrong? Am I trying to view the output incorrectly?
    The line 1 is still in the for loop. So you print all emails seen so far for

    every field the split gave you.

    Code with more debug in the right place:

    ---------------------------

    #! /usr/bin/perl
    use strict;
    use warnings;

    my @email;
    open (FH, "<complex.txt") or die "$!";

    local $/ = "\nA,"; # make \nA, the record seperator

    while(<FH>){ # read the next record
    print "This record holdes:\n$_ \n";

    my @fields = split ",|\n", $_; # split at , or \n
    my $b_index; # 0 for every new record
    for (my $i=0; $i<=$#fields; $i++){
    if ($fields[$i] eq "B") {$b_index=$i; next;}
    elsif ($fields[$i] =~ /^ADV_.*/) {push @email, $fields[$b_index+4];
    last;}
    } # end for

    print "End of record.\n\n"
    } # end while

    print "@email"; #last line in script

    -----------------------------

    On my box that prints the 2 emails you wanted.
    I hope I didnt get something totally screwed.

    Let me know if that does it or not. Thx,
    Wolf


    Andrew Hughes Guest

Posting Permissions

  • You may not post new threads
  • You may post replies
  • You may not post attachments
  • You may not edit your posts

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139