Ask a Question related to PERL Beginners, Design and Development.
-
Michael S. Robeson II #1
formatting the loop
Hi all!
Well, based on the input I have received from everyone thus far I have
been able to cobble the following code together (See below for the
input and out put of of this script).
Anyway, though it works great I am having a tough time trying to figure
out WHY it works. I am especially having trouble with the line: "next
unless s/^\s*(\S+)//" in relation to the while loop it is in.
Basically, I do not understand how the script is differentiating the
">bob" line in the input from the lines of "agactgatcg" (again see
input and output at bottom). I know that the "$/" has something to do
with this, but I am not sure how or why it works.
I hate to sound like a dummy, but if anyone can help me understand WHAT
the script is doing in the "while loop" I would really appreciate it. I
think if I can understand the mechanics behind this script it will only
help me my future understanding of writing PERL scripts. Especially,
when it comes to regular expressions and loops. Heck, if there is a
better way to do certain parts of this let me know! Also, special
thanks to James Gray for the help thus far!! Till then, I'll be
wracking my head with my PERL books!
The working script:
_________
#!/usr/bin/perl
use warnings;
use strict;
print "Enter the path of the INFILE to be processed:\n";
# For example "rotifer.txt" or "../Desktop/Folder/rotifer.txt"
chomp (my $infile = <STDIN>);
open(INFILE, $infile)
or die "Can't open INFILE for input: $!";
print "Enter in the path of the OUTFILE:\n";
# For example "rotifer_out.txt" or "../Desktop/Folder/rotifer_out.txt"
chomp (my $outfile = <STDIN>);
open(OUTFILE, ">$outfile")
or die "Can't open OUTFILE for input: $!";
print "Enter in the LENGTH you want the sequence to be:\n";
my ( $len ) = <STDIN> =~ /(\d+)/ or die "Invalid length parameter";
print OUTFILE "R 1 $len\n\n\n\n"; # The top of the file.
$/ = '>'; # Set input operator
while ( <INFILE> ) {
chomp;
next unless s/^\s*(\S+)//;
my $name = $1;
my @char = ( /[a-z]/ig, ( '-' ) x $len )[ 0 .. $len - 1 ];
my $sequence = join( ' ', @char);
$sequence =~ tr/Tt/Uu/;
print OUTFILE " $sequence $name\n";
}
close INFILE;
close OUTFILE;
___________
Again this script is to convert the following data existing as either
single line or multiline sequence data:
### input type 1 ###atcgactagcatcgatcg>bob
acacgtacgactagcac
actgactacgatcgaca>fred
acgcgcgatacggcat
#####
or (as I posted originally)
### input type 2 ###atcgactagcatcgatcgacacgtacgactagcac>bob
actgactacgatcgacaacgcgcgatacggcat>fred
#####
###output##
## Note that the T's are converted to U's in the output! ##
R 1 42
a u c g a c u a g c a u c g a u c g a c a c g u a c g a c u a g c a c
- - - - - - - bob
a c u g a c u a c g a u c g a c a a c g c g c g a u a c g g c a u - -
- - - - - - - fred
####
Michael S. Robeson II Guest
-
Loop option set, but flash doesn't loop
I'm loading some swf files on my website and they all use the same code. They include the loop=true command but some loop and some don't. Does... -
Can a film loop play once, then loop on the last frame(s)?
I need a film loop to play once, then loop playback on the last frame so I can keep the LOOP of the film loop set. This will allow the tell commands... -
Film loop rollovers working with tell sprite, but only if Loop is checked
on mouseWithin me cursor 280 tell sprite 40 --the sprite containing the film loop sprite(60).member = member("networkmapsbuttonroll") --swapping... -
Urgent: Repeat loop and Film loop clash!
Hi All, Scenario I have a script running in which the spelling which was typed in by the student is corrected. The alphabets are moved to... -
Help with loop inside loop and mysql queries
Hi List. I cannot see my error: I have relation tables setup. main id entity_name main_type etc etc date_in 1 test type1 x y 2003-06-02... -
James Edward Gray II #2
Re: formatting the loop
On Feb 11, 2004, at 1:27 PM, Michael S. Robeson II wrote:
[snip]
See comments below, in the code.> Anyway, though it works great I am having a tough time trying to
> figure out WHY it works.
[snip]
Perl. The language you are learning is called Perl, not PERL. :)> I think if I can understand the mechanics behind this script it will
> only help me my future understanding of writing PERL scripts.
[snip]
Here's most of the magic. This sets Perl's input separator to a >> The working script:
> _________
>
> #!/usr/bin/perl
>
> use warnings;
> use strict;
>
> print "Enter the path of the INFILE to be processed:\n";
>
> # For example "rotifer.txt" or "../Desktop/Folder/rotifer.txt"
>
> chomp (my $infile = <STDIN>);
>
> open(INFILE, $infile)
> or die "Can't open INFILE for input: $!";
>
> print "Enter in the path of the OUTFILE:\n";
>
> # For example "rotifer_out.txt" or "../Desktop/Folder/rotifer_out.txt"
>
> chomp (my $outfile = <STDIN>);
>
> open(OUTFILE, ">$outfile")
> or die "Can't open OUTFILE for input: $!";
>
> print "Enter in the LENGTH you want the sequence to be:\n";
> my ( $len ) = <STDIN> =~ /(\d+)/ or die "Invalid length parameter";
>
>
> print OUTFILE "R 1 $len\n\n\n\n"; # The top of the file.
>
> $/ = '>'; # Set input operator
character. That means that <INFILE> won't return a sequence of
characters ending in a \n like it usually does, but a sequence of
characters ending in a >. It basically jumps name to name, in other
words.
chomp() will remove the trailing >.> while ( <INFILE> ) {
> chomp;
Well, if we're reading name to name, the thing right a the beginning of> next unless s/^\s*(\S+)//;
> my $name = $1;
our sequence is going to be a name, right? The above removes the name,
and saves it for later use.
If I may, yuck! This builds up a list of all the A-Za-z characters in> my @char = ( /[a-z]/ig, ( '-' ) x $len )[ 0 .. $len - 1 ];
the string, adds a boat load of extra - characters, trims the whole
list to the length you want and stuffs all that inside @char. It's
also receives a rank of "awful", on the James Gray Scale of
Readability. ;)
join() the sequence on spaces.> my $sequence = join( ' ', @char);
Convert formats.> $sequence =~ tr/Tt/Uu/;
Send it out.> print OUTFILE " $sequence $name\n";
Hope that helps.> }
>
>
> close INFILE;
> close OUTFILE;
James
James Edward Gray II Guest
-
Rob Dixon #3
Re: formatting the loop
James Edward Gray II wrote:
Exactly. If you don't believe in magic, don't write in Perl:>>> > $/ = '>'; # Set input operator
> Here's most of the magic.
most people don't.
Rob
Rob Dixon Guest
-
Michael S. Robeson II #4
Re: formatting the loop
See comments below.
On Feb 11, 2004, at 2:55 PM, James Edward Gray II wrote:
Hehe, thanks. :-)> On Feb 11, 2004, at 1:27 PM, Michael S. Robeson II wrote:
>
> [snip]
>>>> Anyway, though it works great I am having a tough time trying to
>> figure out WHY it works.
> See comments below, in the code.
>
> [snip]
>>>> I think if I can understand the mechanics behind this script it will
>> only help me my future understanding of writing PERL scripts.
> Perl. The language you are learning is called Perl, not PERL. :)
>
[snip]> [snip]
>
OK that makes pretty good sense. I understand that now, I hope. See>>> $/ = '>'; # Set input operator
> Here's most of the magic. This sets Perl's input separator to a >
> character. That means that <INFILE> won't return a sequence of
> characters ending in a \n like it usually does, but a sequence of
> characters ending in a >. It basically jumps name to name, in other
> words.
>>>> while ( <INFILE> ) {
>> chomp;
> chomp() will remove the trailing >.
next comment.
OK, I think this is were my problem is. That is how does it know that>>>> next unless s/^\s*(\S+)//;
>> my $name = $1;
> Well, if we're reading name to name, the thing right a the beginning
> of our sequence is going to be a name, right? The above removes the
> name, and saves it for later use.
the characters as in "bob" or "fred" are the names and not mistaking
the sequence of letters "agtcaccgatg" to be placed in memory ($name).
Basically I am reading the following:
next unless s/^\s*(\S+)//;
as "Go to the next line unless you see a line with zero or more
whitespace characters followed by one or more non-whitespace characters
and save the non-whitespace characters in memory." If this is correct
then how can perl tell the difference between the lines containing
"bob" or "fred" (and put then in memory) and the "acgatctagc" (and not
put these in memory) because both lines of data seem to fit the
expression pattern to me. I think it has something to do with how perl
is reading through the file that makes this work?
So, there is something I am "missing", not noticing or realizing here.
Maybe I've been staring at the code for far to long and should take a
break! :-)
Yeah, I need to clean that up a bit!>>>> my @char = ( /[a-z]/ig, ( '-' ) x $len )[ 0 .. $len - 1 ];
> If I may, yuck! This builds up a list of all the A-Za-z characters in
> the string, adds a boat load of extra - characters, trims the whole
> list to the length you want and stuffs all that inside @char. It's
> also receives a rank of "awful", on the James Gray Scale of
> Readability. ;)
>
[snip]
-Mike
Michael S. Robeson II Guest
-
James Edward Gray II #5
Re: formatting the loop
On Feb 11, 2004, at 2:35 PM, Michael S. Robeson II wrote:
Not line. We're not reading lines anymore. We're reading chunks of>>>>>> next unless s/^\s*(\S+)//;
>>> my $name = $1;
>> Well, if we're reading name to name, the thing right a the beginning
>> of our sequence is going to be a name, right? The above removes the
>> name, and saves it for later use.
> OK, I think this is were my problem is. That is how does it know that
> the characters as in "bob" or "fred" are the names and not mistaking
> the sequence of letters "agtcaccgatg" to be placed in memory ($name).
> Basically I am reading the following:
>
> next unless s/^\s*(\S+)//;
>
> as "Go to the next line
characters ending in a >, remember?
Not quite. ^ matching at the beginning of our chunk, not the beginning> unless you see a line with zero or more whitespace characters followed
> by one or more non-whitespace characters
of a line. It's "unless you start with zero-or more whitespace
characters, following by one or more non-white-space characters..."
Those "one or more non-white-space characters" are going to be the name
at the beginning. There's also going to be a \n (a whitespace
character) at the end of that name, to keep it from going into the
sequence.
In my English, it reads, "Unless you can rip a name off the front of> and save the non-whitespace characters in memory."
this chunk, skip it." ;) So the only time it ever does any skipping,
is if the whole chunk is whitespace (or nothing), which would keep it
from finding a name. I imagine this only skips the very first read,
which probably won't have anything interesting between the front of the
file and the first > character.
Yes, it's reading > to >. Also, ^ matches at the beginning of a> If this is correct then how can perl tell the difference between the
> lines containing "bob" or "fred" (and put then in memory) and the
> "acgatctagc" (and not put these in memory) because both lines of data
> seem to fit the expression pattern to me. I think it has something to
> do with how perl is reading through the file that makes this work?
string, not a line, by default.
That's why I gave you the paragraph version earlier today. I thought
it was a little easier to follow. ;)
Definitely. Have a break. It clears the mind. Come back refreshed> So, there is something I am "missing", not noticing or realizing here.
> Maybe I've been staring at the code for far to long and should take a
> break! :-)
and reread this message until you break through the fog.
Or just ask more questions and I'll try again. :D
James
James Edward Gray II Guest
-
Michael S. Robeson II #6
Re: formatting the loop
On Feb 11, 2004, at 2:55 PM, James Edward Gray II wrote:
[snip]
[snip]> my @char = ( /[a-z]/ig, ( '-' ) x $len )[ 0 .. $len - 1 ];
>
> If I may, yuck! This builds up a list of all the A-Za-z characters in
> the string, adds a boat load of extra - characters, trims the whole
> list to the length you want and stuffs all that inside @char. It's
> also receives a rank of "awful", on the James Gray Scale of
> Readability. ;)
Ok, now I understand. I found that my problem was with how the "next"
command was operating in conjunction with the grouping of characters.
Ok, making progress. :-)
Now, about that array slice I have:
my @char = ( /[a-z]/ig, ( '-' ) x $len) [0 .. $len - 1];
I know it wastes a lot of memory and makes perl do much extra work.
However, when I try to replace that line with something like this:
my @char = ( /[a-z]/ig, ( '-' ) x ($len - length) ;
it doesn't work the way I thought it would (gee what a thought). I
would like to express the code similar to
( '-' ) x ($len - length)
because it is easy for me to read and it tells you clearly what is
going on. However, every time I try to implement something like that I
get unexpected output or I have to really rewrite the loop. Which I
have been unable to troubleshot as you have been seeing. :-) I think
the 'length' command it also counting any '\n' characters or something,
because my out put ends up with different lengths like this when I use
the ($len - length) way :
a c u g a c g a g u - - - - - - - - bob
a c u g a c u a g c u g - - - - - - - fred
with this input:
actgacgagt>bob
actgactagctg>fred
The reason I went with /[a-z]/ig is because some sequence data uses
other letters to denote ambiguity and other things. I guess I can only
list the letters it uses but I was just lazy and typed in the entire
range of "a to z".
I will be continuing to work on it but here is the code as it stands
now (with that awful array slice).
#!/usr/bin/perl
use warnings;
use strict;
print "Enter the path of the INFILE to be processed:\n";
# For example "rotifer.txt" or "../Desktop/Folder/rotifer.txt"
chomp (my $infile = <STDIN>);
open(INFILE, $infile)
or die "Can't open INFILE for input: $!";
print "Enter in the path of the OUTFILE:\n";
# For example "rotifer_out.txt" or "../Desktop/Folder/rotifer_out.txt"
chomp (my $outfile = <STDIN>);
open(OUTFILE, ">$outfile")
or die "Can't open OUTFILE for input: $!";
print "Enter in the LENGTH you want the sequence to be:\n";
my ( $len ) = <STDIN> =~ /(\d+)/ or die "Invalid length parameter";
print OUTFILE "R 1 $len\n\n\n\n"; # The top of the file is supposed
$/ = '>'; # Set input operator
while ( <INFILE> ) {
chomp;
next unless s/^\s*(.+)//; # delete name and place in memory
my $name = $1; # what ever in memory saved as $name
my @char = ( /[a-z]/ig, ( '-' ) x $len) [0 .. $len -1]; # take only
sequence letters and
# and add '-' to the end
my $sequence = join( ' ', @char); # turn into scalar
$sequence =~ tr/Tt/Uu/; # convert T's to U's
print OUTFILE " $sequence $name\n";
}
close INFILE;
close OUTFILE;
Michael S. Robeson II Guest
-
James Edward Gray II #7
Re: formatting the loop
On Feb 12, 2004, at 10:06 AM, Michael S. Robeson II wrote:
Excellent. I knew we would get there.> On Feb 11, 2004, at 2:55 PM, James Edward Gray II wrote:
>
> [snip]
>>>> my @char = ( /[a-z]/ig, ( '-' ) x $len )[ 0 .. $len - 1 ];
>>
>> If I may, yuck! This builds up a list of all the A-Za-z characters
>> in the string, adds a boat load of extra - characters, trims the
>> whole list to the length you want and stuffs all that inside @char.
>> It's also receives a rank of "awful", on the James Gray Scale of
>> Readability. ;)
> [snip]
>
> Ok, now I understand. I found that my problem was with how the "next"
> command was operating in conjunction with the grouping of characters.
> Ok, making progress. :-)
Ah, it's pretty small potatoes to quibble over, really. I don't think> Now, about that array slice I have:
>
> my @char = ( /[a-z]/ig, ( '-' ) x $len) [0 .. $len - 1];
>
> I know it wastes a lot of memory and makes perl do much extra work.
> However, when I try to replace that line with something like this:
it's in any danger of slowing your code significantly or making you buy
more RAM.
Na, something like this won't work because you won't know the length of> my @char = ( /[a-z]/ig, ( '-' ) x ($len - length) ;
>
> it doesn't work the way I thought it would (gee what a thought). I
> would like to express the code similar to ( '-' ) x ($len - length)
those characters, until you stick the somewhere. Length by default
works on $_, which still contains a big mess of sequence characters and
whitespace.
I think your big hang up is trying to do it all in one line. Two or
three is fine, right? <laughs> And of course, there's nothing wrong
with the current solution. It does work. You only need to replace it
if you want to. There's always more than one way. Use what you like.
I'll see if I can add a suggestion below...
Right here, $_ holds our sequence, plus some junk. We can just work> #!/usr/bin/perl
>
> use warnings;
> use strict;
>
> print "Enter the path of the INFILE to be processed:\n";
>
> # For example "rotifer.txt" or "../Desktop/Folder/rotifer.txt"
>
> chomp (my $infile = <STDIN>);
>
> open(INFILE, $infile)
> or die "Can't open INFILE for input: $!";
>
> print "Enter in the path of the OUTFILE:\n";
>
> # For example "rotifer_out.txt" or "../Desktop/Folder/rotifer_out.txt"
>
> chomp (my $outfile = <STDIN>);
>
> open(OUTFILE, ">$outfile")
> or die "Can't open OUTFILE for input: $!";
>
> print "Enter in the LENGTH you want the sequence to be:\n";
> my ( $len ) = <STDIN> =~ /(\d+)/ or die "Invalid length parameter";
>
>
> print OUTFILE "R 1 $len\n\n\n\n"; # The top of the file is supposed
>
> $/ = '>'; # Set input operator
>
> while ( <INFILE> ) {
> chomp;
> next unless s/^\s*(.+)//; # delete name and place in memory
> my $name = $1; # what ever in memory saved as $name
with $_ then, if we want to.
Alternative to the above two lines:> my @char = ( /[a-z]/ig, ( '-' ) x $len) [0 .. $len -1]; # take
> only sequence letters and
> # and add '-' to the end
> my $sequence = join( ' ', @char); # turn into scalar
>
tr/A-Za-z//cd; # remove junk from $_
$_ .= '-' x ($len - length) if length() < $len; # add dashes
s/\b|\B/ /g; # space out
Then this would become:> $sequence =~ tr/Tt/Uu/; # convert T's to U's
tr/Tt/Uu/;
And this:> print OUTFILE " $sequence $name\n";
print OUTFILE "$_ $name\n";
Hope that helps.> }
>
>
> close INFILE;
> close OUTFILE;
James
James Edward Gray II Guest



Reply With Quote

