formatting and syntax

Ask a Question related to PERL Beginners, Design and Development.

  1. #1

    Default formatting and syntax

    Hi I am all still to new to PERL and I am having trouble playing with
    formatting my data into a new format. So here is my problem:

    I have data (DNA sequence) in a file that looks like this:

    ####
    # Infile
    ####
    >bob
    AGTGATGCCGACG
    >fred
    ACGCATATCGCAT
    >jon
    CAGTACGATTTATC

    and I need it converted to:

    ####
    # Outfile
    ####
    R 1 20

    A G U G A T G C C G A C G - - - - - - - bob
    A C G C A U A U C G C A U - - - - - - - fred
    C A G U A C G A U U U A U C - - - - - - jon


    The "R 1" is static and should always appear. The "20" at the top of
    the new file should be a number defined by the user, that is they
    should be prompted for the length they wish the sequence to be. That is
    the total length of the sequence plus the added dashes could be 20 or
    3000 or whatever. So, if they type 20 and there is only 10 letters in
    that row then the script should add 10 dashes to bring that total up to
    the 20 chosen by the user.

    Note that there should be a space between all letters and dashes -
    including a space at the beginning. Then there are supposed to be 7
    spaces after the sequence string followed by the name as shown in the
    example output file above. Also, of note is the fact that all of the
    T's are changed to U's. For those of you that know biology I am not
    only switching formats of the data but also changing DNA to RNA.

    I hope I am explaining this clear enough, but here (see below) is as
    far as I can get with the code. I just do not know how to structure the
    loop/code to do this. I always have trouble with manipulating data the
    way I want when it comes to a loop. I would prefer an easier to
    understand code rather than an efficient code. This way I can learn the
    simple stuff first and learn the short-cuts later. Thanks to anyone who
    can help.

    - Cheers!
    - Mike

    ######
    #!/usr/bin/perl
    use warnings;
    use strict;

    print "Enter the path of the INFILE to be processed:\n";

    # For example "rotifer.txt" or "../Desktop/Folder/rotifer.txt"

    chomp (my $infile = <STDIN>);

    open(INFILE, $infile)
    or die "Can't open INFILE for input: $!";

    print "Enter in the path of the OUTFILE:\n";

    # For example "rotifer_out.txt" or "../Desktop/Folder/rotifer_out.txt"

    chomp (my $outfile = <STDIN>);

    open(OUTFILE, ">$outfile")
    or die "Can't open OUTFILE for input: $!";

    print "Enter in the LENGTH you want the sequence to be:\n";
    my ( $len ) = <STDIN> =~ /(\d+)/ or die "Invalid length parameter";


    print OUTFILE "R 1 $len\n\n\n\n"; # The top of the file is supposed

    # type of loop or structure to follow ?????

    #############

    Michael S. Robeson II Guest

  2. Similar Questions and Discussions

    1. Need help formatting syntax for SELECT query on a Bittype data field
      :confused; I have been trying to build a SQL statement in my coldfusion page that queries our SQL Server database for a bunch of records. I want to...
    2. Need help formatting url
      Hi I have code that writes links, these links call a javascript function and pass in a url. So let;s say I have a client side function like this...
    3. Formatting
      I computer has XP and I want to clean out my C drive by reformatting it. When I right click on my C drive and click on formatting it does not allow...
    4. Formatting C
      I want to clean off my hard drive c and reinstall windows xp which is already intalled. How do I format c: to remove the existing XP. Thanks.
    5. Formatting ???
      Hello What I understand is you are looking for special codings in html off some characters which have a special meaning in html. For example,...
  3. #2

    Default Re: formatting and syntax

    Michael S. Robeson II wrote:
    >
    > Hi I am all still to new to PERL and I am having trouble playing with
    > formatting my data into a new format. So here is my problem:
    >
    > I have data (DNA sequence) in a file that looks like this:
    [snip]

    Please don't talk about interesting stuff like DNA sequences on a Perl
    group. We need less distraction.

    Rob


    Rob Dixon Guest

  4. #3

    Default Re: formatting and syntax


    On Feb 4, 2004, at 11:35 AM, Michael S. Robeson II wrote:
    > Hi I am all still to new to PERL and I am having trouble playing with
    > formatting my data into a new format. So here is my problem:
    >
    > I have data (DNA sequence) in a file that looks like this:
    >
    > ####
    > # Infile
    > ####
    > >bob
    > AGTGATGCCGACG
    > >fred
    > ACGCATATCGCAT
    > >jon
    > CAGTACGATTTATC
    >
    > and I need it converted to:
    >
    > ####
    > # Outfile
    > ####
    > R 1 20
    >
    > A G U G A T G C C G A C G - - - - - - - bob
    > A C G C A U A U C G C A U - - - - - - - fred
    > C A G U A C G A U U U A U C - - - - - - jon
    >
    >
    > The "R 1" is static and should always appear. The "20" at the top of
    > the new file should be a number defined by the user, that is they
    > should be prompted for the length they wish the sequence to be. That
    > is the total length of the sequence plus the added dashes could be 20
    > or 3000 or whatever. So, if they type 20 and there is only 10 letters
    > in that row then the script should add 10 dashes to bring that total
    > up to the 20 chosen by the user.
    >
    > Note that there should be a space between all letters and dashes -
    > including a space at the beginning. Then there are supposed to be 7
    > spaces after the sequence string followed by the name as shown in the
    > example output file above. Also, of note is the fact that all of the
    > T's are changed to U's. For those of you that know biology I am not
    > only switching formats of the data but also changing DNA to RNA.
    >
    > I hope I am explaining this clear enough, but here (see below) is as
    > far as I can get with the code. I just do not know how to structure
    > the loop/code to do this. I always have trouble with manipulating data
    > the way I want when it comes to a loop. I would prefer an easier to
    > understand code rather than an efficient code. This way I can learn
    > the simple stuff first and learn the short-cuts later. Thanks to
    > anyone who can help.
    >
    > - Cheers!
    > - Mike
    >
    > ######
    > #!/usr/bin/perl
    > use warnings;
    > use strict;
    >
    > print "Enter the path of the INFILE to be processed:\n";
    >
    > # For example "rotifer.txt" or "../Desktop/Folder/rotifer.txt"
    >
    > chomp (my $infile = <STDIN>);
    >
    > open(INFILE, $infile)
    > or die "Can't open INFILE for input: $!";
    >
    > print "Enter in the path of the OUTFILE:\n";
    >
    > # For example "rotifer_out.txt" or "../Desktop/Folder/rotifer_out.txt"
    >
    > chomp (my $outfile = <STDIN>);
    >
    > open(OUTFILE, ">$outfile")
    > or die "Can't open OUTFILE for input: $!";
    >
    > print "Enter in the LENGTH you want the sequence to be:\n";
    > my ( $len ) = <STDIN> =~ /(\d+)/ or die "Invalid length parameter";
    >
    >
    > print OUTFILE "R 1 $len\n\n\n\n"; # The top of the file is supposed
    my $name;
    while (<INFILE>) {
    chomp;
    if (/^>(\w+)/) { $name = $1; }
    else {
    tr/T/U/; # convert Ts to Us
    substr($_, $len) = '' if length($_) > $len; # shorten, if needed
    $_ .= '.' x ($len - length($_)) if length($_) < $len; # lengthen, if
    needed
    s/\b|\B/ /g; # add spaces
    print OUTFILE "$_ $name\n"; # print
    }
    }

    Hope that helps.

    James

    James Edward Gray II Guest

  5. #4

    Default Re: formatting and syntax

    Michael S. Robeson II wrote:
    > I have data (DNA sequence) in a file that looks like this:
    >
    > ####
    > # Infile
    > ####
    > >bob
    > AGTGATGCCGACG
    > >fred
    > ACGCATATCGCAT
    > >jon
    > CAGTACGATTTATC
    >
    > and I need it converted to:
    >
    > ####
    > # Outfile
    > ####
    > R 1 20
    >
    > A G U G A T G C C G A C G - - - - - - - bob
    > A C G C A U A U C G C A U - - - - - - - fred
    > C A G U A C G A U U U A U C - - - - - - jon
    there are many ways of doing that. here is one:

    #!/usr/bin/perl -w
    use strict;

    #--
    #-- discard the first 3 header lines
    #--
    <DATA> for 1..3;

    #--
    #-- read each ' >'
    #--
    $/ = ' >';

    while(<DATA>){

    next unless(my($n,$s) = /(.+)\n(.+)/);

    #--
    #-- pad dna sequence to 20 bytes and translate T to U
    #-- here, you will prompt the user to enter a number instead
    #--
    ($s .= '-'x(20-length($s))) =~ y/T/U/;

    #--
    #-- put space after each character
    #--
    $s =~ s/./$& /g;

    print "$s\t$n\n";
    }

    __DATA__
    ####
    # Infile
    ####
    >bob
    AGTGATGCCGACG
    >fred
    ACGCATATCGCAT
    >jon
    CAGTACGATTTATC

    __END__

    prints:

    A G U G A U G C C G A C G - - - - - - - bob
    A C G C A U A U C G C A U - - - - - - - fred
    C A G U A C G A U U U A U C - - - - - - jon

    david
    --
    sub'_{print"@_ ";* \ = * __ ,\ & \}
    sub'__{print"@_ ";* \ = * ___ ,\ & \}
    sub'___{print"@_ ";* \ = * ____ ,\ & \}
    sub'____{print"@_,\n"}&{_+Just}(another)->(Perl)->(Hacker)
    David Guest

  6. #5

    Default Re: formatting and syntax

    On Feb 4, Michael S. Robeson II said:
    > >bob
    >AGTGATGCCGACG
    > >fred
    >ACGCATATCGCAT
    > >jon
    >CAGTACGATTTATC
    >R 1 20
    >
    > A G U G A T G C C G A C G - - - - - - - bob
    > A C G C A U A U C G C A U - - - - - - - fred
    > C A G U A C G A U U U A U C - - - - - - jon
    >
    >
    >The "R 1" is static and should always appear. The "20" at the top of
    >the new file should be a number defined by the user, that is they
    >should be prompted for the length they wish the sequence to be. That is
    >the total length of the sequence plus the added dashes could be 20 or
    >3000 or whatever. So, if they type 20 and there is only 10 letters in
    >that row then the script should add 10 dashes to bring that total up to
    >the 20 chosen by the user.
    I'll provide one way to do this:

    # assuming $size has the number entered by the user

    while (<FILE>) {
    my ($name) = / >(.+)/; # get the line name
    chomp(my $DNA = <FILE>); # get the next line (the DNA)

    # add $size - length() dashes to the end of $DNA
    $DNA .= "-" x ($size - length $DNA);

    # print the DNA with spaces, then a tab, then the name
    print join(" ", split //, $DNA), "\t$name\n";
    }

    --
    Jeff "japhy" Pinyan [email]japhy@pobox.com[/email] [url]http://www.pobox.com/~japhy/[/url]
    RPI Acacia brother #734 [url]http://www.perlmonks.org/[/url] [url]http://www.cpan.org/[/url]
    <stu> what does y/// stand for? <tenderpuss> why, yansliterate of course.
    [ I'm looking for programming work. If you like my work, let me know. ]

    Jeff 'Japhy' Pinyan Guest

  7. #6

    Default Re: formatting and syntax

    "Michael S. Robeson II" wrote:
    > Hi I am all still to new to PERL and I am having trouble playing with
    > formatting my data into a new format. So here is my problem:
    >
    > I have data (DNA sequence) in a file that looks like this:
    >
    > ####
    > # Infile
    > ####
    > >bob
    > AGTGATGCCGACG
    > >fred
    > ACGCATATCGCAT
    > >jon
    > CAGTACGATTTATC
    Good we can see the input structure here. What jumps out at me is that the
    input file comes in pairs of lines. You will want to structure your input
    routine to read and handle the lines by the pair, then.
    >
    >
    > and I need it converted to:
    >
    > ####
    > # Outfile
    > ####
    > R 1 20
    >
    > A G U G A T G C C G A C G - - - - - - - bob
    > A C G C A U A U C G C A U - - - - - - - fred
    > C A G U A C G A U U U A U C - - - - - - jon
    >
    [snip-a picture is worth athousands woprds, and you showed us the picture
    above.]

    Well we have a fairly simple problem here, I'd say:

    Greetings! E:\d_drive\perlStuff\giffy>perl -w
    my $sequence_length = 20;
    my $line = <DATA>;
    chomp $line;
    while ($line) {
    my $sequence_tag = trim_line($line);
    $line = <DATA>;
    chomp $line;
    my @nucleotides = split //, $line;
    push @nucleotides, '_' for (1..($sequence_length - @nucleotides));
    print join(' ', @nucleotides), " $sequence_tag\n";
    $line = <DATA>;
    chomp $line;
    }

    sub trim_line {
    my $in_line = shift;
    $in_line =~ s/^ >//;
    chomp $in_line;
    return $in_line;
    }

    __DATA__
    >bob
    AGTGATGCCGACG
    A G T G A T G C C G A C G _ _ _ _ _ _ _ bob
    >fred
    ACGCATATCGCAT
    A C G C A T A T C G C A T _ _ _ _ _ _ _ fred
    >jon
    CAGTACGATTTATC
    C A G T A C G A T T T A T C _ _ _ _ _ _ jon

    or, better yet...

    Greetings! E:\d_drive\perlStuff\giffy>perl -w
    my $sequence_length = 20;
    my $line = <DATA>;
    chomp $line;
    while ($line) {
    my $sequence_tag = trim_line($line);
    $line = <DATA>;
    chomp $line;
    $line = print_underscore_padded($line, $sequence_length, $sequence_tag);

    }


    sub trim_line {
    my $in_line = shift;
    $in_line =~ s/^ >//;
    chomp $in_line;
    return $in_line;
    }

    sub print_underscore_padded {
    my ($line, $sequence_length, $sequence_tag) = @_;
    my @nucleotides = split //, $line;
    push @nucleotides, '_' for (1..($sequence_length - @nucleotides));
    print join(' ', @nucleotides), " $sequence_tag\n";
    $line = <DATA>;
    chomp $line;
    return $line;
    }

    __DATA__
    >bob
    AGTGATGCCGACG
    A G T G A T G C C G A C G _ _ _ _ _ _ _ bob
    >fred
    ACGCATATCGCAT
    A C G C A T A T C G C A T _ _ _ _ _ _ _ fred
    >jon
    CAGTACGATTTATC
    C A G T A C G A T T T A T C _ _ _ _ _ _ jon


    Does that help?

    Joseph

    R. Joseph Newton Guest

  8. #7

    Default Re: formatting and syntax

    On Feb 5, R. Joseph Newton said:
    >my $sequence_length = 20;
    >my $line = <DATA>;
    >chomp $line;
    >while ($line) {
    > my $sequence_tag = trim_line($line);
    > $line = <DATA>;
    > chomp $line;
    > my @nucleotides = split //, $line;
    > push @nucleotides, '_' for (1..($sequence_length - @nucleotides));
    I'd be in favor of:

    push @nucleotides, ('_') x ($sequence_length - @nucleotides);

    The 'x' operator on a list returns the list elements repeated the
    specified number of times.
    >__DATA__
    > >bob
    >AGTGATGCCGACG
    >A G T G A T G C C G A C G _ _ _ _ _ _ _ bob
    Ack. You're mixing the input with the output!

    --
    Jeff "japhy" Pinyan [email]japhy@pobox.com[/email] [url]http://www.pobox.com/~japhy/[/url]
    RPI Acacia brother #734 [url]http://www.perlmonks.org/[/url] [url]http://www.cpan.org/[/url]
    <stu> what does y/// stand for? <tenderpuss> why, yansliterate of course.
    [ I'm looking for programming work. If you like my work, let me know. ]

    Jeff 'Japhy' Pinyan Guest

  9. #8

    Default Re: formatting and syntax

    (redirected to Perl Beginners by James)

    On Feb 11, 2004, at 10:34 AM, Michael S. Robeson II wrote:
    > Hey, thanks again for the perl code.
    You're welcome, but let's keep our discussion on the mailing list so we
    can all help and learn.
    > However, I forgot to take into account that the original input file
    > can look one of two ways:
    Ah, the old switcheroo. Gotcha. <laughs>
    > >bob
    > atcgactagcatcgatcg
    > acacgtacgactagcac
    >
    > >fred
    > actgactacgatcgaca
    > acgcgcgatacggcat
    >
    > or (as I posted originally)
    >
    > >bob
    > atcgactagcatcgatcgacacgtacgactagcac
    >
    > >fred
    > actgactacgatcgacaacgcgcgatacggcat
    >
    > to be out put as:
    >
    > R 1 42
    > a t c g a c t a g c a t c g a t c g a c a c g t a c g a c t a g c a c
    > - - - - - - - bob
    > a c t g a c t a c g a t c g a c a a c g c g c g a t a c g g c a t - -
    > - - - - - - - fred
    How about this time I give you the code to parse the two types of input
    and you tie it in with the parts we've already figured out to get the
    right output? Just shout if you run into more problems.

    James

    #!/usr/bin/perl

    use strict;
    use warnings;

    local $/ = ''; # use "paragraph mode"

    while (<DATA>) {
    unless (s/^>(.+?)\s*\n//) { # find and remove the name
    warn "Skipping unknown format: $_";
    next;
    }

    my $name = $1; # save name
    tr/\n //d; # join multi-line sequences

    print "Name: $name, Sequence: $_\n"; # show off our progess
    }

    __DATA__
    >bob
    atcgactagcatcgatcg
    acacgtacgactagcac
    >fred
    actgactacgatcgaca
    acgcgcgatacggcat
    >bob
    atcgactagcatcgatcgacacgtacgactagcac
    >fred
    actgactacgatcgacaacgcgcgatacggcat

    James Edward Gray II Guest

Posting Permissions

  • You may not post new threads
  • You may post replies
  • You may not post attachments
  • You may not edit your posts

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139