Make this into a script to parse?

Ask a Question related to PERL Beginners, Design and Development.

  1. #1

    Default Make this into a script to parse?

    I'm back to dealing with the main issue of a badly formatted file being
    brought down from an archaic system and needing to be cleaned up before
    being passed to another user or a database table. I have the code
    below, which pulls the whole file in and parse it line by line. That
    problem is still that when the stuff is done parsing the file, the file
    still has a ton of white spaces left in it.

    What I would like to do is when I first open the file (another piece of
    this massive script) is tell it to just run a sub program on each piece
    that does the same thing as the stuff below, unfortunately I am not sure
    of the way to do this.

    This piece I DO have:
    sub cleanup{

    use strict;

    my $file = "info/bad.sql";
    my $newfile = "info/inventory.sql";
    my $line;

    open (OLDFILE, "< $file");
    open (NEWFILE, "> $newfile");
    while ($line = <OLDFILE>) {
    $line =~ s/^ //mg;
    $line =~ s/ $//mg;
    $line =~ s/\t/|/mg;
    $line =~ s/\s+/ /mg;
    $line =~ s/^\s*//mg;
    $line =~ s/\s*$//mg;
    $line =~ s/\s*$//mg;
    ### The following lines mod the files to reflect inches and feet
    $line =~ s/(?<=\d)"/in. /mg;
    $line =~ s/(?<=\d)'/ft. /mg;
    $line =~ s/^\s+//mg;
    $line =~ s/\s+$//mg;
    # $line =~ s/\s*\|\s*//mg;
    ### $line =~ s/ |/|/mg;
    ### $line =~ s/| /|/mg;

    print NEWFILE "$line\n";
    }
    close OLDFILE;
    close NEWFILE;

    print "$newfile has now been created\n";
    }

    The first pass of the code which piece of the array of data into another
    location further back in the file:
    sub MySQL_id_data
    {
    $database_file = "info/salesa1";
    open(INF,$database_file) or dienice("Can't open $database_file: $!
    \n");
    @grok = <INF>;
    close(INF);
    $file1 = "info/salesa1-data";
    open (FILE, ">$file1") || die "Can't write to $file1 : error $!\n";
    $inv = 1;

    foreach $i (@grok)
    {
    chomp($i);

    ($item_num,$item_desc,$b1,$b2,$b3,$b4,$cc,$vn,$qoh ,$qc,$qor,$bc,$sc,$yp)
    = split(/\|/,$i);
    print FILE
    "$inv|$item_num|$item_desc|$b1|$b2|$b3|$b4|$cc|$vn |$qoh|$qc|$qor|$bc|$it
    em_num|$sc|$yp\n";
    $inv++;
    }
    close FILE;
    }


    HELP!!

    Thanks,
    Robert

    Lone Wolf Guest

  2. Similar Questions and Discussions

    1. Script to parse files
      I've been working with this since wolf and jeff and john sent me some stuff, I think I actually based everything on wolf's code excerpts. I'm sure...
    2. How to parse large script faster
      I've created a perl script automatically based on an ini file (we want to replace the ini file holding a number of rules by regular expressions in...
    3. #25348 [Opn->Csd]: make install: "parse error"
      ID: 25348 Updated by: sniper@php.net Reported By: rjmooney at lsb dot syr dot edu -Status: Open +Status: ...
    4. #25348 [NEW]: make install: "parse error"
      From: rjmooney at lsb dot syr dot edu Operating system: OpenBSD 3.2 PHP version: 4.3.3 PHP Bug Type: Reproducible crash Bug...
    5. how do you create a script that make a DUN
      I need to know how do to write a script to create and configure a DUN connection
  3. #2

    Default Re: Make this into a script to parse?

    On Feb 4, Lone Wolf said:
    >I'm back to dealing with the main issue of a badly formatted file being
    >brought down from an archaic system and needing to be cleaned up before
    >being passed to another user or a database table. I have the code
    >below, which pulls the whole file in and parse it line by line. That
    >problem is still that when the stuff is done parsing the file, the file
    >still has a ton of white spaces left in it.
    > open (OLDFILE, "< $file");
    > open (NEWFILE, "> $newfile");
    > while ($line = <OLDFILE>) {
    > $line =~ s/^ //mg;
    > $line =~ s/ $//mg;
    > $line =~ s/\t/|/mg;
    > $line =~ s/\s+/ /mg;
    > $line =~ s/^\s*//mg;
    > $line =~ s/\s*$//mg;
    > $line =~ s/\s*$//mg;
    These regexes (above and below) have NO need for the /m modifier, and only
    a few of them have any need for the /g modifier.

    $line =~ s/^\s+//; # remove leading spaces
    $line =~ s/\s+$/; # remove trailing spaces
    $line =~ tr/\t/|/; # change all \t's to |'s
    $line =~ tr/ //s; # squash multiple spaces on one space

    Those four lines (two regexes, two transliterations) do what the seven
    lines above them do.
    > $line =~ s/(?<=\d)"/in. /mg;
    > $line =~ s/(?<=\d)'/ft. /mg;
    Still don't need the /m modifier.
    > $line =~ s/^\s+//mg;
    > $line =~ s/\s+$//mg;
    The first one is totally useless, and the second is only needed because
    it's possible $line now ends in "in. ", which means the trailing space
    should be removed. The solution, then, is to do the two \d regexes FIRST,
    and THEN do the other regexes.
    ># $line =~ s/\s*\|\s*//mg;
    >### $line =~ s/ |/|/mg;
    >### $line =~ s/| /|/mg;
    Are those not needed, or commented out because they're not working
    properly?
    > print NEWFILE "$line\n";
    > }
    > close OLDFILE;
    > close NEWFILE;
    >
    > print "$newfile has now been created\n";
    >}
    >sub MySQL_id_data {
    > $database_file = "info/salesa1";
    > open(INF,$database_file) or dienice("Can't open $database_file: $!\n");
    > @grok = <INF>;
    > close(INF);
    There's no reason to slurp a file into an array. Just loop over the lines
    of the file like you have with the while loop above.
    > $file1 = "info/salesa1-data";
    > open (FILE, ">$file1") || die "Can't write to $file1 : error $!\n";
    > $inv = 1;
    >
    > foreach $i (@grok) {
    > chomp($i);
    >
    >($item_num,$item_desc,$b1,$b2,$b3,$b4,$cc,$vn,$qo h,$qc,$qor,$bc,$sc,$yp)
    >= split(/\|/,$i);
    > print FILE
    >"$inv|$item_num|$item_desc|$b1|$b2|$b3|$b4|$cc|$v n|$qoh|$qc|$qor|$bc|$it
    >em_num|$sc|$yp\n";
    > $inv++;
    > }
    Oh good God. Do you know what that for loop is DOING?

    for each element in @grok:
    remove the newline
    split it on pipes into some variables
    print $inv, those variables with pipes in between, and add a newline

    That is terribly insane.
    > close FILE;
    >}
    Here's my rewrite:

    sub MySQL_id_data {
    my $db_file = "info/salesa1";
    my $info_file = "$db_file-data";

    open DB, "< $db_file" or dienice("can't open $db_file: $!");
    open INFO, "> $info_file" or dience("can't write $info_file: $!");
    print INFO "$.|$_" while <DB>;
    close INFO;
    close DB;
    }

    --
    Jeff "japhy" Pinyan [email]japhy@pobox.com[/email] [url]http://www.pobox.com/~japhy/[/url]
    RPI Acacia brother #734 [url]http://www.perlmonks.org/[/url] [url]http://www.cpan.org/[/url]
    <stu> what does y/// stand for? <tenderpuss> why, yansliterate of course.
    [ I'm looking for programming work. If you like my work, let me know. ]

    Jeff 'Japhy' Pinyan Guest

  4. #3

    Default Re: Make this into a script to parse?

    For Quality purpouses, Lone Wolf 's mail on Thursday 05 February 2004 00:52
    may have been monitored or recorded as:
    > I'm back to dealing with the main issue of a badly formatted file being
    > brought down from an archaic system and needing to be cleaned up before
    > being passed to another user or a database table. I have the code
    I assume by saying you are back that you are talking ofyour thread from 12/17:
    "get rid of whitesace around pipes??".
    > below, which pulls the whole file in and parse it line by line. That
    > problem is still that when the stuff is done parsing the file, the file
    > still has a ton of white spaces left in it.
    did you try something like
    my @fields = split /\s*\|\s*/, $line;
    as suggested by James, Jeff and Randy?
    Why didnt it work - the problem looks still pretty much the same, does it?
    > What I would like to do is when I first open the file (another piece of
    > this massive script) is tell it to just run a sub program on each piece
    > that does the same thing as the stuff below, unfortunately I am not sure
    > of the way to do this.
    Frankly, after a while of looking at your code Im still not sure what you want
    do - that might be due to my ignorance, but you would really help me (and I
    guess others too) understand, if you could post some sample data before they
    go into your program and a line of how you expect thme to look like after
    they were processed by your code - I guess that would make it easier to
    figure out, where what goes how (or so).

    Wolf

    Wolf Blaum Guest

  5. #4

    Default RE: Make this into a script to parse?

    I tried the my @fields and I did not get it to work, probably because my
    coding skills have not improved enough lately to be worthy of perl.
    Thank goodness I never said I had perfect code, because I would
    definitely be lying.

    I attached 2 files, one the beginning data, the other the .sql file that
    I load into MySQL database. The files are about 3000 lines before and
    after so I cut out the first 30 lines and put them in the files to the
    list.

    What I need to figure out is how to make a sub call that when I pull in
    the file will remove all extraneous white space. Something I can copy
    into another Perl program to parse another set of files (ARGH!). I've
    learned not to tell the bosses I can write a script to handle the errors
    of the salesmen. I currently use a back piece of PHP coding to handle
    the extra spaces in the pages that use the data, but for another project
    I can't use that work-around.

    I know I can do something along the lines of:
    (from an HTML generating page with a sort)

    foreach $i (sort ByName @grok)
    {
    chomp($i);
    ($type,$description,$parts,$numb) = split(/\|/,$i);
    print <<INFO2;

    <tr><td>$type</td><td>$description</td><td>$parts</td><td>$numb</td></tr
    >
    INFO2
    }

    The sub program:
    sub ByName {
    @a = split(/\|/,$a);
    @b = split(/\|/,$b);
    $a[1] cmp $b[1];
    }

    But I am still not sure how to make the $i go through, and it is
    probably something simple I am missing.

    Thanks!!
    Robert

    Lone Wolf Guest

  6. #5

    Default Re: Make this into a script to parse?

    On Wed, 4 Feb 2004, Jeff 'japhy' Pinyan wrote:

    <snip>
    > >
    > > foreach $i (@grok) {
    > > chomp($i);
    > >
    > >($item_num,$item_desc,$b1,$b2,$b3,$b4,$cc,$vn,$qo h,$qc,$qor,$bc,$sc,$yp)
    > >= split(/\|/,$i);
    > > print FILE
    > >"$inv|$item_num|$item_desc|$b1|$b2|$b3|$b4|$cc|$v n|$qoh|$qc|$qor|$bc|$it
    > >em_num|$sc|$yp\n";
    > > $inv++;
    > > }
    >
    > Oh good God. Do you know what that for loop is DOING?
    >
    > for each element in @grok:
    > remove the newline
    > split it on pipes into some variables
    > print $inv, those variables with pipes in between, and add a newline
    >
    > That is terribly insane.
    Jeff, The input and output lines are not identical. The output line
    prefixes $inv at the front and inserts $item_num between $bc and $sc. I
    don't know why $item_num is repeated. Granted that I think a more
    efficient construct might be:

    my ($item_num,$a,$b) = $i =~ /(.*?|)((?:.*?|){11})(.*)/;
    print LINE "$inv|$item_num|$a|$item_num|$b\n";

    I think that I have that right. Well, assuming that the original is
    correct.


    --
    Maranatha!
    John McKown

    John McKown Guest

  7. #6

    Default Re: Make this into a script to parse?

    On Feb 4, John McKown said:
    >On Wed, 4 Feb 2004, Jeff 'japhy' Pinyan wrote:
    >
    >> > foreach $i (@grok) {
    >> > chomp($i);
    >> >
    >> >($item_num,$item_desc,$b1,$b2,$b3,$b4,$cc,$vn,$qo h,$qc,$qor,$bc,$sc,$yp)
    >> >= split(/\|/,$i);
    >> > print FILE
    >> >"$inv|$item_num|$item_desc|$b1|$b2|$b3|$b4|$cc|$v n|$qoh|$qc|$qor|$bc|$it
    >> >em_num|$sc|$yp\n";
    >> > $inv++;
    >> > }
    >>
    >> Oh good God. Do you know what that for loop is DOING?
    >> That is terribly insane.
    >
    >Jeff, The input and output lines are not identical. The output line
    >prefixes $inv at the front and inserts $item_num between $bc and $sc. I
    >don't know why $item_num is repeated. Granted that I think a more
    >efficient construct might be:
    Bah, I missed that. Then I'd use split(), but just use an array.

    while (<IN>) {
    local $" = "|";
    my @fields = split /\|/;
    print OUT "$.|@fields[0..11,0,12..13]";
    }

    But this begs the question, WHY does item_num have to be used TWICE in the
    SAME line of data. This smells of poor coding on the other side. It's
    still ugly.

    --
    Jeff "japhy" Pinyan [email]japhy@pobox.com[/email] [url]http://www.pobox.com/~japhy/[/url]
    RPI Acacia brother #734 [url]http://www.perlmonks.org/[/url] [url]http://www.cpan.org/[/url]
    <stu> what does y/// stand for? <tenderpuss> why, yansliterate of course.
    [ I'm looking for programming work. If you like my work, let me know. ]

    Jeff 'Japhy' Pinyan Guest

  8. #7

    Default Re: Make this into a script to parse?

    For Quality purpouses, Lone Wolf 's mail on Thursday 05 February 2004 04:23
    may have been monitored or recorded as:

    Hi
    > Thank goodness I never said I had perfect code, because I would
    > definitely be lying.
    no worries - I post code to get feedback. Thats the whole ideaof learning it.
    > I attached 2 files, one the beginning data, the other the .sql file that
    > I load into MySQL database. The files are about 3000 lines before and
    > after so I cut out the first 30 lines and put them in the files to the
    > list.
    Ok - then, again: Do not read these files into mem at once unless you really
    have to (which should be close to never).

    here is a script that uses your given data:

    ---snip---
    #!/usr/bin/perl

    use strict;
    use warnings;
    my (@fields, $lng);

    opendir INDIR , "./sql" or die "Can't open dir with before files:$!";

    foreach my $infile (grep {!/^\./} readdir INDIR) {
    #read all the files in your home/sql dir
    #read only files that do not start with a .
    my ($i,$rec);

    open INFILE, "<./sql//$infile" or die "Can't open $infile: $!";
    open OUTFILE, ">./${infile}.out" or die "Can't open ${infile}.out at home:
    $!";
    while (<INFILE>) {
    $rec++;
    chomp;
    @fields = split /\s*\|\s*/, $_;
    $fields[0] =~ s/^\s+//;
    #there is probably a way to get rid of the trailing spaces in the first
    entry using split,I just couldnt think of any.

    $lng = @fields unless $lng; #set $lng for first record
    print "The following record: $i has ", scalar @fields, " fields as compared
    to $lng fields in the first record! Skip. : $_\n" and next unless $lng ==
    @fields;
    #poor quality control of your input data: check if all reords have the same
    number of fields or skip and print record otherwise.
    $i++;
    print OUTFILE $i;
    print OUTFILE "|$_" foreach (@fields);
    print OUTFILE "|$fields[0]\n"; #your trailing ID
    }
    close INFILE;
    close OUTFILE;
    print "Read $rec records from ./sql/$infile and printed $i into ./
    ${infile}.out\n";
    }
    closedir INDIR;
    ---snap---

    A couple of hints:

    The script reads all files in the sql subdir of your home dir and produces the
    corrosponding filname.out in your homedir.

    the split splits as written by Jeff et al.
    I coulndt think of a better way to substtute the leading spaces for the first
    field.
    Anyone better suggestions?

    you end up with a final \n in each outfile.

    You rewrite it into a sub by substititing the line
    foreach my $infile (grep {!/^\./} readdir INDIR) {
    with

    sub whatever{
    ....
    foreach my $infile (@_) {

    and call th sub with
    &whatever ("file1", "file2", ...);

    of course you may want to change the open statements to, if you dont have your
    infiles in ./sql

    Hope that gets you started, Wolf






    Wolf Blaum Guest

  9. #8

    Default Re: Make this into a script to parse?

    For Quality purpouses, wolf blaum 's mail on Thursday 05 February 2004 06:07
    may have been monitored or recorded as:
    > The script reads all files in the sql subdir of your home dir and produces
    > the corrosponding filname.out in your homedir.
    shame on me: of course it reads all the files in the sub dir sql of the
    CURRENT DIR, not the home dir. use ~/ if you want your homedir...

    Well, if been here a while...

    Something else i forgot: why do you need the count on the beginning of the
    line? I hope not as a unique (primary) key for the dbtable you feed that
    into.There should be an AUTO_INCREMENT in your DB for that.
    And talking about DBs:
    According to te 3rd rule of Normalisation as outlined by e.f.codd of ibm in
    the 1970s: (to that i was arround at this time...)

    "An Entity is said to be in 3rd normal form if it is allready in 2nd normal
    form and no nonidentifying attributs are dependent on any other
    nonidentifying attributs."

    The repeat of a value like $fields[0] clearly violates this rule.
    See [url]www.databasejournal.com/sqletc/article.php/1428511[/url]
    on Db Design.

    Good night, wolf

    Wolf Blaum Guest

  10. #9

    Default Re: Make this into a script to parse?

    John McKown wrote:
    >
    > my ($item_num,$a,$b) = $i =~ /(.*?|)((?:.*?|){11})(.*)/;
    > print LINE "$inv|$item_num|$a|$item_num|$b\n";
    >
    > I think that I have that right. Well, assuming that the original is
    > correct.
    No John,

    If you are using $a and $b as variables in any context other than the sort
    built-in function, then you do not have it right. Choose meaningful variable
    names.

    Joseph

    R. Joseph Newton Guest

Posting Permissions

  • You may not post new threads
  • You may post replies
  • You may not post attachments
  • You may not edit your posts

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139