Ask a Question related to PERL Beginners, Design and Development.
-
Michael S. Robeson II #1
formatting and syntax
Hi I am all still to new to PERL and I am having trouble playing with
formatting my data into a new format. So here is my problem:
I have data (DNA sequence) in a file that looks like this:
####
# Infile
####AGTGATGCCGACG>bobACGCATATCGCAT>fredCAGTACGATTTATC>jon
and I need it converted to:
####
# Outfile
####
R 1 20
A G U G A T G C C G A C G - - - - - - - bob
A C G C A U A U C G C A U - - - - - - - fred
C A G U A C G A U U U A U C - - - - - - jon
The "R 1" is static and should always appear. The "20" at the top of
the new file should be a number defined by the user, that is they
should be prompted for the length they wish the sequence to be. That is
the total length of the sequence plus the added dashes could be 20 or
3000 or whatever. So, if they type 20 and there is only 10 letters in
that row then the script should add 10 dashes to bring that total up to
the 20 chosen by the user.
Note that there should be a space between all letters and dashes -
including a space at the beginning. Then there are supposed to be 7
spaces after the sequence string followed by the name as shown in the
example output file above. Also, of note is the fact that all of the
T's are changed to U's. For those of you that know biology I am not
only switching formats of the data but also changing DNA to RNA.
I hope I am explaining this clear enough, but here (see below) is as
far as I can get with the code. I just do not know how to structure the
loop/code to do this. I always have trouble with manipulating data the
way I want when it comes to a loop. I would prefer an easier to
understand code rather than an efficient code. This way I can learn the
simple stuff first and learn the short-cuts later. Thanks to anyone who
can help.
- Cheers!
- Mike
######
#!/usr/bin/perl
use warnings;
use strict;
print "Enter the path of the INFILE to be processed:\n";
# For example "rotifer.txt" or "../Desktop/Folder/rotifer.txt"
chomp (my $infile = <STDIN>);
open(INFILE, $infile)
or die "Can't open INFILE for input: $!";
print "Enter in the path of the OUTFILE:\n";
# For example "rotifer_out.txt" or "../Desktop/Folder/rotifer_out.txt"
chomp (my $outfile = <STDIN>);
open(OUTFILE, ">$outfile")
or die "Can't open OUTFILE for input: $!";
print "Enter in the LENGTH you want the sequence to be:\n";
my ( $len ) = <STDIN> =~ /(\d+)/ or die "Invalid length parameter";
print OUTFILE "R 1 $len\n\n\n\n"; # The top of the file is supposed
# type of loop or structure to follow ?????
#############
Michael S. Robeson II Guest
-
Need help formatting syntax for SELECT query on a Bittype data field
:confused; I have been trying to build a SQL statement in my coldfusion page that queries our SQL Server database for a bunch of records. I want to... -
Need help formatting url
Hi I have code that writes links, these links call a javascript function and pass in a url. So let;s say I have a client side function like this... -
Formatting
I computer has XP and I want to clean out my C drive by reformatting it. When I right click on my C drive and click on formatting it does not allow... -
Formatting C
I want to clean off my hard drive c and reinstall windows xp which is already intalled. How do I format c: to remove the existing XP. Thanks. -
Formatting ???
Hello What I understand is you are looking for special codings in html off some characters which have a special meaning in html. For example,... -
Rob Dixon #2
Re: formatting and syntax
Michael S. Robeson II wrote:
[snip]>
> Hi I am all still to new to PERL and I am having trouble playing with
> formatting my data into a new format. So here is my problem:
>
> I have data (DNA sequence) in a file that looks like this:
Please don't talk about interesting stuff like DNA sequences on a Perl
group. We need less distraction.
Rob
Rob Dixon Guest
-
James Edward Gray II #3
Re: formatting and syntax
On Feb 4, 2004, at 11:35 AM, Michael S. Robeson II wrote:
my $name;> Hi I am all still to new to PERL and I am having trouble playing with
> formatting my data into a new format. So here is my problem:
>
> I have data (DNA sequence) in a file that looks like this:
>
> ####
> # Infile
> ####> AGTGATGCCGACG> >bob> ACGCATATCGCAT> >fred> CAGTACGATTTATC> >jon
>
> and I need it converted to:
>
> ####
> # Outfile
> ####
> R 1 20
>
> A G U G A T G C C G A C G - - - - - - - bob
> A C G C A U A U C G C A U - - - - - - - fred
> C A G U A C G A U U U A U C - - - - - - jon
>
>
> The "R 1" is static and should always appear. The "20" at the top of
> the new file should be a number defined by the user, that is they
> should be prompted for the length they wish the sequence to be. That
> is the total length of the sequence plus the added dashes could be 20
> or 3000 or whatever. So, if they type 20 and there is only 10 letters
> in that row then the script should add 10 dashes to bring that total
> up to the 20 chosen by the user.
>
> Note that there should be a space between all letters and dashes -
> including a space at the beginning. Then there are supposed to be 7
> spaces after the sequence string followed by the name as shown in the
> example output file above. Also, of note is the fact that all of the
> T's are changed to U's. For those of you that know biology I am not
> only switching formats of the data but also changing DNA to RNA.
>
> I hope I am explaining this clear enough, but here (see below) is as
> far as I can get with the code. I just do not know how to structure
> the loop/code to do this. I always have trouble with manipulating data
> the way I want when it comes to a loop. I would prefer an easier to
> understand code rather than an efficient code. This way I can learn
> the simple stuff first and learn the short-cuts later. Thanks to
> anyone who can help.
>
> - Cheers!
> - Mike
>
> ######
> #!/usr/bin/perl
> use warnings;
> use strict;
>
> print "Enter the path of the INFILE to be processed:\n";
>
> # For example "rotifer.txt" or "../Desktop/Folder/rotifer.txt"
>
> chomp (my $infile = <STDIN>);
>
> open(INFILE, $infile)
> or die "Can't open INFILE for input: $!";
>
> print "Enter in the path of the OUTFILE:\n";
>
> # For example "rotifer_out.txt" or "../Desktop/Folder/rotifer_out.txt"
>
> chomp (my $outfile = <STDIN>);
>
> open(OUTFILE, ">$outfile")
> or die "Can't open OUTFILE for input: $!";
>
> print "Enter in the LENGTH you want the sequence to be:\n";
> my ( $len ) = <STDIN> =~ /(\d+)/ or die "Invalid length parameter";
>
>
> print OUTFILE "R 1 $len\n\n\n\n"; # The top of the file is supposed
while (<INFILE>) {
chomp;
if (/^>(\w+)/) { $name = $1; }
else {
tr/T/U/; # convert Ts to Us
substr($_, $len) = '' if length($_) > $len; # shorten, if needed
$_ .= '.' x ($len - length($_)) if length($_) < $len; # lengthen, if
needed
s/\b|\B/ /g; # add spaces
print OUTFILE "$_ $name\n"; # print
}
}
Hope that helps.
James
James Edward Gray II Guest
-
David #4
Re: formatting and syntax
Michael S. Robeson II wrote:
there are many ways of doing that. here is one:> I have data (DNA sequence) in a file that looks like this:
>
> ####
> # Infile
> ####> AGTGATGCCGACG> >bob> ACGCATATCGCAT> >fred> CAGTACGATTTATC> >jon
>
> and I need it converted to:
>
> ####
> # Outfile
> ####
> R 1 20
>
> A G U G A T G C C G A C G - - - - - - - bob
> A C G C A U A U C G C A U - - - - - - - fred
> C A G U A C G A U U U A U C - - - - - - jon
#!/usr/bin/perl -w
use strict;
#--
#-- discard the first 3 header lines
#--
<DATA> for 1..3;
#--
#-- read each ' >'
#--
$/ = ' >';
while(<DATA>){
next unless(my($n,$s) = /(.+)\n(.+)/);
#--
#-- pad dna sequence to 20 bytes and translate T to U
#-- here, you will prompt the user to enter a number instead
#--
($s .= '-'x(20-length($s))) =~ y/T/U/;
#--
#-- put space after each character
#--
$s =~ s/./$& /g;
print "$s\t$n\n";
}
__DATA__
####
# Infile
####AGTGATGCCGACG>bobACGCATATCGCAT>fredCAGTACGATTTATC>jon
__END__
prints:
A G U G A U G C C G A C G - - - - - - - bob
A C G C A U A U C G C A U - - - - - - - fred
C A G U A C G A U U U A U C - - - - - - jon
david
--
sub'_{print"@_ ";* \ = * __ ,\ & \}
sub'__{print"@_ ";* \ = * ___ ,\ & \}
sub'___{print"@_ ";* \ = * ____ ,\ & \}
sub'____{print"@_,\n"}&{_+Just}(another)->(Perl)->(Hacker)
David Guest
-
Jeff 'Japhy' Pinyan #5
Re: formatting and syntax
On Feb 4, Michael S. Robeson II said:
>AGTGATGCCGACG> >bob>ACGCATATCGCAT> >fred>CAGTACGATTTATC> >jonI'll provide one way to do this:>R 1 20
>
> A G U G A T G C C G A C G - - - - - - - bob
> A C G C A U A U C G C A U - - - - - - - fred
> C A G U A C G A U U U A U C - - - - - - jon
>
>
>The "R 1" is static and should always appear. The "20" at the top of
>the new file should be a number defined by the user, that is they
>should be prompted for the length they wish the sequence to be. That is
>the total length of the sequence plus the added dashes could be 20 or
>3000 or whatever. So, if they type 20 and there is only 10 letters in
>that row then the script should add 10 dashes to bring that total up to
>the 20 chosen by the user.
# assuming $size has the number entered by the user
while (<FILE>) {
my ($name) = / >(.+)/; # get the line name
chomp(my $DNA = <FILE>); # get the next line (the DNA)
# add $size - length() dashes to the end of $DNA
$DNA .= "-" x ($size - length $DNA);
# print the DNA with spaces, then a tab, then the name
print join(" ", split //, $DNA), "\t$name\n";
}
--
Jeff "japhy" Pinyan [email]japhy@pobox.com[/email] [url]http://www.pobox.com/~japhy/[/url]
RPI Acacia brother #734 [url]http://www.perlmonks.org/[/url] [url]http://www.cpan.org/[/url]
<stu> what does y/// stand for? <tenderpuss> why, yansliterate of course.
[ I'm looking for programming work. If you like my work, let me know. ]
Jeff 'Japhy' Pinyan Guest
-
R. Joseph Newton #6
Re: formatting and syntax
"Michael S. Robeson II" wrote:
Good we can see the input structure here. What jumps out at me is that the> Hi I am all still to new to PERL and I am having trouble playing with
> formatting my data into a new format. So here is my problem:
>
> I have data (DNA sequence) in a file that looks like this:
>
> ####
> # Infile
> ####> AGTGATGCCGACG> >bob> ACGCATATCGCAT> >fred> CAGTACGATTTATC> >jon
input file comes in pairs of lines. You will want to structure your input
routine to read and handle the lines by the pair, then.
>
>
> and I need it converted to:
>
> ####
> # Outfile
> ####
> R 1 20
>
> A G U G A T G C C G A C G - - - - - - - bob
> A C G C A U A U C G C A U - - - - - - - fred
> C A G U A C G A U U U A U C - - - - - - jon[snip-a picture is worth athousands woprds, and you showed us the picture>
above.]
Well we have a fairly simple problem here, I'd say:
Greetings! E:\d_drive\perlStuff\giffy>perl -w
my $sequence_length = 20;
my $line = <DATA>;
chomp $line;
while ($line) {
my $sequence_tag = trim_line($line);
$line = <DATA>;
chomp $line;
my @nucleotides = split //, $line;
push @nucleotides, '_' for (1..($sequence_length - @nucleotides));
print join(' ', @nucleotides), " $sequence_tag\n";
$line = <DATA>;
chomp $line;
}
sub trim_line {
my $in_line = shift;
$in_line =~ s/^ >//;
chomp $in_line;
return $in_line;
}
__DATA__AGTGATGCCGACG>bob
A G T G A T G C C G A C G _ _ _ _ _ _ _ bobACGCATATCGCAT>fred
A C G C A T A T C G C A T _ _ _ _ _ _ _ fredCAGTACGATTTATC>jon
C A G T A C G A T T T A T C _ _ _ _ _ _ jon
or, better yet...
Greetings! E:\d_drive\perlStuff\giffy>perl -w
my $sequence_length = 20;
my $line = <DATA>;
chomp $line;
while ($line) {
my $sequence_tag = trim_line($line);
$line = <DATA>;
chomp $line;
$line = print_underscore_padded($line, $sequence_length, $sequence_tag);
}
sub trim_line {
my $in_line = shift;
$in_line =~ s/^ >//;
chomp $in_line;
return $in_line;
}
sub print_underscore_padded {
my ($line, $sequence_length, $sequence_tag) = @_;
my @nucleotides = split //, $line;
push @nucleotides, '_' for (1..($sequence_length - @nucleotides));
print join(' ', @nucleotides), " $sequence_tag\n";
$line = <DATA>;
chomp $line;
return $line;
}
__DATA__AGTGATGCCGACG>bob
A G T G A T G C C G A C G _ _ _ _ _ _ _ bobACGCATATCGCAT>fred
A C G C A T A T C G C A T _ _ _ _ _ _ _ fredCAGTACGATTTATC>jon
C A G T A C G A T T T A T C _ _ _ _ _ _ jon
Does that help?
Joseph
R. Joseph Newton Guest
-
Jeff 'Japhy' Pinyan #7
Re: formatting and syntax
On Feb 5, R. Joseph Newton said:
I'd be in favor of:>my $sequence_length = 20;
>my $line = <DATA>;
>chomp $line;
>while ($line) {
> my $sequence_tag = trim_line($line);
> $line = <DATA>;
> chomp $line;
> my @nucleotides = split //, $line;
> push @nucleotides, '_' for (1..($sequence_length - @nucleotides));
push @nucleotides, ('_') x ($sequence_length - @nucleotides);
The 'x' operator on a list returns the list elements repeated the
specified number of times.
Ack. You're mixing the input with the output!>__DATA__>AGTGATGCCGACG> >bob
>A G T G A T G C C G A C G _ _ _ _ _ _ _ bob
--
Jeff "japhy" Pinyan [email]japhy@pobox.com[/email] [url]http://www.pobox.com/~japhy/[/url]
RPI Acacia brother #734 [url]http://www.perlmonks.org/[/url] [url]http://www.cpan.org/[/url]
<stu> what does y/// stand for? <tenderpuss> why, yansliterate of course.
[ I'm looking for programming work. If you like my work, let me know. ]
Jeff 'Japhy' Pinyan Guest
-
James Edward Gray II #8
Re: formatting and syntax
(redirected to Perl Beginners by James)
On Feb 11, 2004, at 10:34 AM, Michael S. Robeson II wrote:
You're welcome, but let's keep our discussion on the mailing list so we> Hey, thanks again for the perl code.
can all help and learn.
Ah, the old switcheroo. Gotcha. <laughs>> However, I forgot to take into account that the original input file
> can look one of two ways:
How about this time I give you the code to parse the two types of input> atcgactagcatcgatcg> >bob
> acacgtacgactagcac
>> actgactacgatcgaca> >fred
> acgcgcgatacggcat
>
> or (as I posted originally)
>> atcgactagcatcgatcgacacgtacgactagcac> >bob
>> actgactacgatcgacaacgcgcgatacggcat> >fred
>
> to be out put as:
>
> R 1 42
> a t c g a c t a g c a t c g a t c g a c a c g t a c g a c t a g c a c
> - - - - - - - bob
> a c t g a c t a c g a t c g a c a a c g c g c g a t a c g g c a t - -
> - - - - - - - fred
and you tie it in with the parts we've already figured out to get the
right output? Just shout if you run into more problems.
James
#!/usr/bin/perl
use strict;
use warnings;
local $/ = ''; # use "paragraph mode"
while (<DATA>) {
unless (s/^>(.+?)\s*\n//) { # find and remove the name
warn "Skipping unknown format: $_";
next;
}
my $name = $1; # save name
tr/\n //d; # join multi-line sequences
print "Name: $name, Sequence: $_\n"; # show off our progess
}
__DATA__atcgactagcatcgatcg>bob
acacgtacgactagcac
actgactacgatcgaca>fred
acgcgcgatacggcat
atcgactagcatcgatcgacacgtacgactagcac>bob
actgactacgatcgacaacgcgcgatacggcat>fred
James Edward Gray II Guest



Reply With Quote

