Author |
Message |
DocDodg #1 / 24
|
Scaling a DNA string
Hi, I have what is a tricky biology problem, but an easy perl problem. Unfortunately, while I'm competent biologist, I'm a complete newbie at perl. As you may know, the DNA which codes for all the instructions of life is simply a string composed of the four characters A, C, G, or T. We have experimental evidence that suggests that a motif composed of four G's or four C's in a row might have implications for how some genes are regulated. But, it would matter how close these motifs are to the genes they are suppose to control. So, I grabbed 2000 characters from nearby an important gene and stored it in a string. Here is what part of that might look like: ...ACGACGTCCAGGGGGGTTGTTACGTCCCCAATCAGTCGGGGCTATTCAGTC... Next I replaced all the unimportant characters with at dash so we can see where these motifs are: ...----------GGGGGG---------CCCC--------GGGG----------... But as you might imagine, I'm having trouble printing out these 2000 character long strings in a readable format. What I need to do is scale the string down to a reasonable size. What I've tried to do is use a for loop with and index of 10 to replace all the instances or 10 dashes with a single dash. If a G or C is found in the 10 character region, a single G or C is printed. Here is how the scaled version looks: ...-GC-GG-... The problem is that someone looking at this scaled version would think that this string of DNA characters has four motifs in this short region. But it only has three. The last GGGG gets counted twice because it is crosses over two different 10 character long chunks. And, if I scale the string by looking for GGGG or CCCC in any 10 character long chunk, I will underestimated the number of motifs to this: ...-GC----... What I really want to scale the string to is this ...-GC-G--... ( or this ...-GC--G-..., small position shifts are inconsequential as long as the number of motifs is accurate). I hope this is clear and not to long winded. I figured the problem would be more fun if the background context was included. Here is the code I have written so far to scale the string: for ($i = 0; $i < length ($sequence); $i += 10) { if (substr ($sequence, $i, 10) =~ /G/i) { print "G" } elsif (substr ($sequence, $i, 10) =~/C/i) { print "C"; } else { print "-"; } Quote: }
TIA
|
Mon, 05 Apr 2004 03:52:57 GMT |
|
|
Wyzell #2 / 24
|
Scaling a DNA string
Quote: > Hi, > I have what is a tricky biology problem, but an easy perl problem. > Unfortunately, while I'm competent biologist, I'm a complete newbie at perl. > As you may know, the DNA which codes for all the instructions of life is > simply a string composed of the four characters A, C, G, or T. We have > experimental evidence that suggests that a motif composed of four G's or > four C's in a row might have implications for how some genes are regulated. > But, it would matter how close these motifs are to the genes they are > suppose to control. > So, I grabbed 2000 characters from nearby an important gene and stored it in > a string. Here is what part of that might look like: > ...ACGACGTCCAGGGGGGTTGTTACGTCCCCAATCAGTCGGGGCTATTCAGTC...
{snip} Quote: > What I really want to scale the string to is this ...-GC-G--... ( or this > ...-GC--G-..., small position shifts are inconsequential as long as the > number of motifs is accurate). > I hope this is clear and not to long winded. I figured the problem would be > more fun if the background context was included. Here is the code I have > written so far to scale the string:
Well, I am not really too clear on exactly what the rules are from the above, but as a starting point I tried the following: $sequence = 'ACGACGTCCAGGGGGGTTGTTACGTCCCCAATCAGTCGGGGCTATTCAGTC'; $sequence =~ s/([GC]){4}/\l$1/g; # Mark 4 char sets as a motif by a single character (lowercase) of itself $sequence =~ tr/GCAT/-/s; # turn everything else into a - and squash them down $sequence = uc($sequence); # Make the Motif uppercase again (for tidiness) print "$sequence\n"; # Print the new version This converts your sample string to: -G-C-G- But since I am not really too clear the exact rules that make up a motif (is 6 G's a motif or a motif and 2 G's ?, Are the character motifs supposed to be on a 4 character boundary? (as in a full set?)) Why did you choose 10 as your substring length, when the sets are made up from groups of 4? Anyway, this might be a starting point to help you further along the way. Wyzelli -- ($a,$b,$w,$t)=(' bottle',' of beer',' on the wall','Take one down, pass it around'); for(reverse(1..100)){$s=($_!=1)?'s':'';$c.="$_$a$s$b$w\n$_$a$s$b\n$t\n"; $_--;$s=($_!=1)?'s':'';$c.="$_$a$s$b$w\n\n";}print"$c*hic*";
|
Mon, 05 Apr 2004 04:39:51 GMT |
|
|
Dave Twee #3 / 24
|
Scaling a DNA string
Quote:
> What I really want to scale the string to is this ...-GC-G--... ( or this > ...-GC--G-..., small position shifts are inconsequential as long as the > number of motifs is accurate). > I hope this is clear and not to long winded. I figured the problem would be > more fun if the background context was included.
Very clear, and an interesting problem. Here's my approach, which shows the relative lengths of the gaps between motifs using a logarithmic scale. I'm assuming that any sequence of four *or more* Cs or Gs is a motif, but you can change the first argument of the split as needed. You can fiddle with the base of the logarithm (2 in this example) to get different scalings. -- Dave Tweed $sequence = 'ACGACGTCCAGGGGGGTTGTTACGTCCCCAATCAGTCGGGGCTATTCAGTC'; # split the sequence into motifs and non-motifs
# show the relative length of the non-motif using a logarithmic scale print '-' x int((log length $non)/log 2); # show the motif as a single character print chop $motif if defined $motif; Quote: }
print "\n";
|
Mon, 05 Apr 2004 06:29:34 GMT |
|
|
Jay Tilt #4 / 24
|
Scaling a DNA string
Quote:
>a motif composed of four G's or >four C's in a row might have implications for how some genes are regulated. >So, I grabbed 2000 characters from nearby an important gene and stored it in >a string. I replaced all the unimportant characters with at dash so we can see >where these motifs are: >...----------GGGGGG---------CCCC--------GGGG----------... >What I need to do is scale the >string down to a reasonable size. What I've tried to do is use a for loop >with and index of 10 to replace all the instances or 10 dashes with a single >dash. If a G or C is found in the 10 character region, a single G or C is >printed. >The last GGGG gets counted twice because it is crosses over >two different 10 character long chunks. >And, if I scale the string by looking for GGGG or CCCC in any 10 character >long chunk, I will underestimated the number of motifs >What I really want to scale the string to is this ...-GC-G--... ( or this >...-GC--G-..., small position shifts are inconsequential as long as the >number of motifs is accurate).
Summarizing the apparent key points, 1. Sequences of four G's or four C's in the string are important. 2. Condense sequences of -'s in the string. 3. Chewing on the string in 10-character chunks is not working well because a single motif may cross the boundary between chunks. How about crunching four or more G's or C's down to one, $sequence =~ s/([GC])\1{3,}/$1/g; then crunching multiple -'s down to one, so the motifs are visually separate. $sequence =~ s/(-)+/$1/g; What should happen where there are three or fewer C's or G's in a row? Keep them, discard them, or some other notation indicating "something there, but not what I'm really looking for"?
|
Mon, 05 Apr 2004 07:01:41 GMT |
|
|
Martien Verbrugg #5 / 24
|
Scaling a DNA string
On Wed, 17 Oct 2001 22:52:57 -0400,
Quote: > Hi, > I have what is a tricky biology problem, but an easy perl problem. > Unfortunately, while I'm competent biologist, I'm a complete newbie at perl.
Does that mean that these problems are approximately equally hard to you? :) Quote: > As you may know, the DNA which codes for all the instructions of life is > simply a string composed of the four characters A, C, G, or T. We have > experimental evidence that suggests that a motif composed of four G's or > four C's in a row might have implications for how some genes are regulated. > But, it would matter how close these motifs are to the genes they are > suppose to control.
Ok, so far I'm still here... :) Quote: > So, I grabbed 2000 characters from nearby an important gene and stored it in > a string. Here is what part of that might look like: > ...ACGACGTCCAGGGGGGTTGTTACGTCCCCAATCAGTCGGGGCTATTCAGTC... > Next I replaced all the unimportant characters with at dash so we can see > where these motifs are: > ...----------GGGGGG---------CCCC--------GGGG----------... > But as you might imagine, I'm having trouble printing out these 2000 > character long strings in a readable format. What I need to do is scale the > string down to a reasonable size. What I've tried to do is use a for loop > with and index of 10 to replace all the instances or 10 dashes with a single > dash. If a G or C is found in the 10 character region, a single G or C is > printed. Here is how the scaled version looks:
But what happens if both a G and a C are found in this 10 character sequence? And what happens if more than one sequence falls in one slice? "GGGG--CCCC" => G or C or something else? "GGGG-GGGG-" => Only one G? or something else? Quote: > ...-GC-GG-... > The problem is that someone looking at this scaled version would think that > this string of DNA characters has four motifs in this short region. But it > only has three. The last GGGG gets counted twice because it is crosses over > two different 10 character long chunks.
And indeed, what do you do with motifs that straddle the boundary? Or worse, what if you have a string like this: 0123456789012345678901234567890123456789 ---CCCC-GGGGG--CCCCCCC--GGGGGGGG--CCCC-- In the above, every 10-character sequence has at least one G and at least one C. 3 out of 5 motifs straddle a boundary. There are 5 motifs, but only 4 10-character sequences. How do you want that expressed? What about coming up with a totally different encoding scheme? One that simply tells you how many repeats it's seen? Your string would be encoded as -10G6-9C4-8G4-10 or maybe -10 G6 -9 C4 -8 G4 -10 Mine would be: -3 C4 -1 G5 -2 C7 -2 G8 -2 C4 -2 The more repetition there is, the more compression you get. You could even consider not putting the minuses in. 3 C4 1 G5 2 C7 2 G8 2 C4 2 All this is assuming that the length of these things, and the gap between them, is somehow important. If it isn't, You probably would have just compressed multiple characters into one (with tr///s, for example) I'm not sure whether that's readable enough, but at least it doesn't suffer from the problems that you've outlined, and the extra problems that I noted. Assuming that you already have the minuses in the string: $_ = "----------GGGGGG---------CCCC--------GGGG----------"; s/(([-CG])\2*)/ ($2 eq "-" ? "" : $2) . length($1) . " " /ge; Of course, there are other ways to do this, even when starting with the raw string, instead of the one with minuses. Something like the following might be a good start: while (m/(.*?)(([CG])\3{3,})/g) { print length($1), " ", length($2), $3, " "; Quote: }
print "\n"; This misses out on the last sequence, if it isn't a motif. I have to go home now, so I don't have time to think of a way to fix that hurdle, without resorting to $' (if that works at all). :) Maybe pos() with substr() could be used, although I think pos() will actually be undefined... Martien -- Martien Verbruggen | | That's not a lie, it's a Trading Post Australia Pty Ltd | terminological inexactitude. |
|
Mon, 05 Apr 2004 07:52:32 GMT |
|
|
DocDodg #6 / 24
|
Scaling a DNA string
Quote:
> >a motif composed of four G's or > >four C's in a row might have implications for how some genes are regulated. > >So, I grabbed 2000 characters from nearby an important gene and stored it in > >a string. I replaced all the unimportant characters with at dash so we can see > >where these motifs are: > >...----------GGGGGG---------CCCC--------GGGG----------... > >What I need to do is scale the > >string down to a reasonable size. What I've tried to do is use a for loop > >with and index of 10 to replace all the instances or 10 dashes with a single > >dash. If a G or C is found in the 10 character region, a single G or C is > >printed. > >The last GGGG gets counted twice because it is crosses over > >two different 10 character long chunks. > >And, if I scale the string by looking for GGGG or CCCC in any 10 character > >long chunk, I will underestimated the number of motifs > >What I really want to scale the string to is this ...-GC-G--... ( or this > >...-GC--G-..., small position shifts are inconsequential as long as the > >number of motifs is accurate). > Summarizing the apparent key points, > 1. Sequences of four G's or four C's in the string are important. > 2. Condense sequences of -'s in the string. > 3. Chewing on the string in 10-character chunks is not working well because > a single motif may cross the boundary between chunks. > How about crunching four or more G's or C's down to one, > $sequence =~ s/([GC])\1{3,}/$1/g; > then crunching multiple -'s down to one, so the motifs are visually > separate. > $sequence =~ s/(-)+/$1/g; > What should happen where there are three or fewer C's or G's in a row? Keep > them, discard them, or some other notation indicating "something there, but > not what I'm really looking for"?
Hi Jay, Thanks for your quick response. I'll answer your last question first. If there are three or fewer C's or G's in a row they can be ignore (replaced with -'s). The problem with crunching four or more G's or C's down to one and then crunching multiple -'s down to one is that the string would not be evenly scaled. What I need to know is not only how many motifs are in the sequence, but were they are in relation to each other and to the ends of the string. For example if a GGGG is at the far right of the string, I would want it to look like this when scaled: -----------------------G--- But if two GGGG motifs are in the middle of the string, it should look like this: ----------G-G-------------- Your answer does give me an idea though. If I scale by 2, i.e.. replace all --'s with a single dash and all GG or CC with a single G or C I won't have the problem of a motif crossing a boundary. I could then repeat this two more times, leaving a string of length 125. This might be small enough for my purposes. The only glitch now, is how to tell GGGGGG that scaled to G or GGGGGGGG (two motifs) that also scales to G. TIA
|
Mon, 05 Apr 2004 13:37:36 GMT |
|
|
David Wal #7 / 24
|
Scaling a DNA string
Quote: > On Wed, 17 Oct 2001 22:52:57 -0400,
>> As you may know, the DNA which codes for all the instructions of life >> is simply a string composed of the four characters A, C, G, or T. We >> have experimental evidence that suggests that a motif composed of >> four G's or four C's in a row might have implications for how some >> genes are regulated. But, it would matter how close these motifs are >> to the genes they are suppose to control. [snip] >> So, I grabbed 2000 characters from nearby an important gene and >> stored it in a string. Here is what part of that might look like: >> ...ACGACGTCCAGGGGGGTTGTTACGTCCCCAATCAGTCGGGGCTATTCAGTC... [snip] > What about coming up with a totally different encoding scheme? One > that simply tells you how many repeats it's seen? Your string would be > encoded as > -10G6-9C4-8G4-10 > or maybe > -10 G6 -9 C4 -8 G4 -10
This works, I guess, but it's visually cluttered and hard to tell what's happening at a glance. Why not just replace everything but the motifs of interest with spaces, and then print them with a ruler over them so you can get their distance from the beginning of the sequence? What I have in mind is the following. $seq contains the gene sequence of interest. my $bases_per_line = 50; my $num_rulers = $bases_per_line / 10; $seq =~ s/(G{4,})/lc $1/ge; $seq =~ s/(C{4,})/lc $1/ge; $seq =~ tr/AGCT/ /; $seq = uc $seq; my $ruler = '----+----'; my $count=0; while ( $seq =~ /(.{$bases_per_line})/g ) { print ' 'x5; my $ruler_start = $count; for (1 .. $num_rulers) { $count += 10; print $ruler, $_; } printf "\n%4d %s\n", $ruler_start, $1; Quote: }
Some sample output: ----+----1----+----2----+----3----+----4----+----5 0 GGGGG CCCCCC GGGG ----+----1----+----2----+----3----+----4----+----5 50 GGGG ----+----1----+----2----+----3----+----4----+----5 100 CCCC IMHO this is much easier to understand, although it uses a lot of screen/page space. (It's not an original idea with me; I saw something similar in some gene sequencing data when I was working as a research assistant for a biostats professor.) Oh, I almost forgot; the above requires a non-proportional font, so if for some strange reason you're reading this newsgroup in a proportional font (why?), it won't look right. -- David Wall
|
Mon, 05 Apr 2004 16:39:44 GMT |
|
|
David Wal #8 / 24
|
Scaling a DNA string
Quote: > my $bases_per_line = 50; > my $num_rulers = $bases_per_line / 10;
Oops. I guess it's obvious, but I only coded this for $bases_per_line to be multiples of 10. I'm sure it could be made more general, but 10's were easiest, and perhaps best. -- David Wall
|
Mon, 05 Apr 2004 16:56:48 GMT |
|
|
Jay Tilt #9 / 24
|
Scaling a DNA string
On Thu, 18 Oct 2001 08:37:36 -0400, "DocDodge" Quote:
>> Summarizing the apparent key points, >> 1. Sequences of four G's or four C's in the string are important. >> 2. Condense sequences of -'s in the string. >> 3. Chewing on the string in 10-character chunks is not working well >> because a single motif may cross the boundary between chunks. >> How about crunching four or more G's or C's down to one, >> $sequence =~ s/([GC])\1{3,}/$1/g; >> then crunching multiple -'s down to one, so the motifs are visually >> separate. >> $sequence =~ s/(-)+/$1/g; >> What should happen where there are three or fewer C's or G's in a row? >> Keep them, discard them, or some other notation indicating "something there, >> but not what I'm really looking for"? >Thanks for your quick response. I'll answer your last question first. If >there are three or fewer C's or G's in a row they can be ignore (replaced >with -'s). >The problem with crunching four or more G's or C's down to one and then >crunching multiple -'s down to one is that the string would not be evenly >scaled. What I need to know is not only how many motifs are in the >sequence, but were they are in relation to each other and to the ends of the >string. >Your answer does give me an idea though. If I scale by 2, i.e.. replace >all --'s with a single dash and all GG or CC with a single G or C I won't >have the problem of a motif crossing a boundary. I could then repeat this >two more times, leaving a string of length 125. This might be small enough >for my purposes. The only glitch now, is how to tell GGGGGG that scaled to >G or GGGGGGGG (two motifs) that also scales to G.
I see better now. I completely missed that the scaling factor should be constant. Adding a couple more things to the list of requirements above, 4. Sequences of three or fewer C's or G's can be ignored. 5. A single character in the output string will represent (roughly) X characters in the input. See how this grabs you. Change chains of four G/C to a single lowercase g/c, because those are what's important and to prevent pulverizing them in the next step $sequence =~ s/([GC])\1{3}/lc $1/eg; Because G/C in the remaining string only exist in chains of three or fewer, which can be ignored, change any remaining G/C into - $sequence =~ tr/GC/--/; Scale chains of -'s down by a constant factor, say 8. 1 to 8 become 1, 9 to 16 become 2, etc. my $scale = 8; # In case you want to play with it $sequence =~ s!(-+)!'-' x (.9 + length($1)/$scale)!eg; ^^^^^ Nuke the ".9 +" part if you want to scale 1 to 8 down to 0, 9 to 16 to 1, etc. And if it's important, make G/C uppercase again. $sequence = uc $sequence; It won't give an exact 1:X scaling. You can't tell exactly where things are in the original string by looking at the result, but you can at least gauge approximate location.
|
Mon, 05 Apr 2004 18:08:43 GMT |
|
|
Jay Tilt #10 / 24
|
Scaling a DNA string
Quote: >Scale chains of -'s down by a constant factor, say 8. >1 to 8 become 1, 9 to 16 become 2, etc. > my $scale = 8; # In case you want to play with it > $sequence =~ s!(-+)!'-' x (.9 + length($1)/$scale)!eg;
That's real damn ugly and silly. If I'd thought just a bit longer before posting I would have said $sequence =~ s/-{1,$scale}/-/g;
|
Mon, 05 Apr 2004 18:21:45 GMT |
|
|
David Wal #11 / 24
|
Scaling a DNA string
Quote: > On Wed, 17 Oct 2001 22:52:57 -0400,
>> As you may know, the DNA which codes for all the instructions of life >> is simply a string composed of the four characters A, C, G, or T. We >> have experimental evidence that suggests that a motif composed of >> four G's or four C's in a row might have implications for how some >> genes are regulated. But, it would matter how close these motifs are >> to the genes they are suppose to control.
[snip] Quote: >> So, I grabbed 2000 characters from nearby an important gene and >> stored it in a string. Here is what part of that might look like: >> ...ACGACGTCCAGGGGGGTTGTTACGTCCCCAATCAGTCGGGGCTATTCAGTC... >> Next I replaced all the unimportant characters with at dash so we can >> see where these motifs are: >> ...----------GGGGGG---------CCCC--------GGGG----------... >> But as you might imagine, I'm having trouble printing out these 2000 >> character long strings in a readable format. What I need to do is >> scale the string down to a reasonable size.
[snip] Quote: > What about coming up with a totally different encoding scheme? One > that simply tells you how many repeats it's seen? Your string would be > encoded as > -10G6-9C4-8G4-10 > or maybe > -10 G6 -9 C4 -8 G4 -10 > Mine would be: > -3 C4 -1 G5 -2 C7 -2 G8 -2 C4 -2 > The more repetition there is, the more compression you get. You could > even consider not putting the minuses in. > 3 C4 1 G5 2 C7 2 G8 2 C4 2
Now I feel silly. I like this idea better than what I did first, but with a slight modification: $sequence =~ s/(C{4,}|G{4,})/lc $1/ge; $sequence =~ tr/AGTC/-/; $sequence =~ s/(-+)/" ".(length $1)." "/ge; $sequence =~ s/^ +//; my $count = 1; for ( split /\s+/, $sequence ) { printf "%6d ", $count; my $position = $count + 1; if ( /\d+/ ) { $count += $_ - 1; } else { $count += length($_) + 1; printf "%20s at %5d \n", $_, $position; } Quote: }
printf "%6d\n", $count; __END__ Sample output: 1 520 gggg at 521 525 549 ccccc at 550 555 817 ccccc at 818 823 1040 cccc at 1041 1045 1105 ccccc at 1106 1111 1120 ggggg at 1121 1126 1187 gggg at 1188 1192 1232 ggggggg at 1233 1240 1270 gggg at 1271 1275 1383 gggg at 1384 1388 1463 cccc at 1464 1468 1484 ccccc at 1485 1490 1541 cccc at 1542 1546 1681 gggg at 1682 1686 1738 ggggg at 1739 1744 1977 cccc at 1978 1982 2000 From position 1 to 520, no motifs. Then we have gggg at 521 to 524, nothing interesting from 525 to 549, etc... The two leftmost columns aren't really necessary, but they were useful in detecting off-by-one errors in the code. I'm sure other tweaks can be made if this suits the OP's purpose, as well as cleaning up the code a bit. -- David Wall
|
Mon, 05 Apr 2004 21:52:05 GMT |
|
|
Martien Verbrugg #12 / 24
|
Scaling a DNA string
On Thu, 18 Oct 2001 20:52:05 -0000,
Quote:
>> The more repetition there is, the more compression you get. You could >> even consider not putting the minuses in. >> 3 C4 1 G5 2 C7 2 G8 2 C4 2 > Now I feel silly. I like this idea better than what I did first, but with > a slight modification:
Looks good too. I just assumed that the original poster's main goal was to get some decent compression, so that the data woiuldn't look as large anymore. Your's is much more readable than mine, of course :) I think it's probably time for the OP to speak up and make sure that we don't go off on wild tangents that are unrelated to what is wanted. There are quite a few ideas in this thread that they could use. Witha bit of picking and choosing they should be able to get the better idea. From a clean readable reporting point of view, I think your solution wins hands down :) Martien -- Martien Verbruggen | | Trading Post Australia Pty Ltd | Curiouser and curiouser, said Alice. |
|
Tue, 06 Apr 2004 00:33:52 GMT |
|
|
DocDodg #13 / 24
|
Scaling a DNA string
Hi, I've finally arrived back were I can post to this newsgroup, and I see that people have been posting great responses all day long. The answer to my problem might lie in this thread; it will take me a while to wrap my mind around some of the more complicated perl. But her is a clarification of the biology of the problem. Martien is right; I am just looking for some decent compression so that the data is more visually manageable. What I didn't tell you is that I have hundreds of these strings, that correspond to hundreds of genes and that I'm going to need to print this out on ordinary letter size paper. Here is what that might look like when done (requires a non-proportional font to look right): --------G--------------------G---------- gene A ----------------CG--G------------------- gene B ---------------G-G-C--C----------------- gene C -----------G---------------------------- gene D ---------------------------------------- gene E --C------------------------------------- gene F ---------------GGG--C--G------------G--- gene G ---------------------------------------- gene H -----------C---------------------------- gene I In this example, each string was derived from 2000 characters, so each dash or letter represents 50 characters. A dash means that in those 50 characters, no motifs are found. A "G" or "C" means that in those 50 characters, one or more motifs were found. I know the resolution of exactly were the motifs lie is not so good, but when I look at this I can gleam a lot of information. You see, I know what these genes do in most cases. And if I know that gene B, gene C, and gene G are all involved in cancer, that cluster of motifs at about the 1000 character mark has meaning. This pattern would not be likely to occur by chance, and so I will perform some biological experiments to see if the GGGG or CCCC are found near other cancer genes, or what the motif can do to ordinary genes. Martin asked the clever question: "GGGG--CCCC" => G or C or something else? "GGGG-GGGG-" => Only one G? or something else? To the latter, initially I called a single GGGG in a region "g" while more than one motif garnered a "G", but when I started to have trouble scaling I eliminated that complications from my code. The answer to the former is the same. It would be usefully to call a region with both GGGG and CCCC in it some other letter (S perhaps) but I'm still taking baby steps He also asked what I want to do with motifs that straddle the boundary. The answer is, put the mark on one side or the other but not both. Which side does not really matter, in the total 2000 characters I won't be off by that much. Some wonder what exactly defines a motif. Is it GGGG or GGGGG and is GGGGGGGG one motif or two. The answer is flexible. Eight G's in a row is certainly two motifs; four G's is one motif. Between 4 and 8 could be one or two motifs, whatever is easier to implement. I've been calling it one motif for now. Finally, sorry for not having a stitch of perl code in this post. I'll be up much of the night working on this problem, so if the interest is still there tomorrow perhaps I can say more.
|
Tue, 06 Apr 2004 03:09:35 GMT |
|
|
Martien Verbrugg #14 / 24
|
Scaling a DNA string
On Thu, 18 Oct 2001 22:09:35 -0400,
Quote: > Hi, > I've finally arrived back were I can post to this newsgroup, and I see that > people have been posting great responses all day long. The answer to my > problem might lie in this thread; it will take me a while to wrap my mind > around some of the more complicated perl. But her is a clarification of the > biology of the problem.
Ok, now that a few of the questions have been settled, here's a small subroutine that (I think) does what you want. It seems rather longish, but most of it is comments. Before relying on its correctness, let it float around this group for a few days. I suspect that I made an error here or there, or that someone has a faster way to do what I do with my two substitutions. Also, it is possible that I still have misunderstood things. #!/usr/local/bin/perl -w use strict; # For this example, data file contents should be # CGACCACCGGCTATGGTTACCCGTGCGTCCATCC Name of gene while (<>) { my ($gene, $name) = split ' ', $_, 2; my $pgene = parse_gene($gene, 40); print "$pgene $name"; Quote: }
# parse_gene($gene, $sample_len) # # Break up a gene in $sample_len blocks. For each block, output # # - if nothing interesting was found # c or g if one C or G motif was found # C or G if more than one C(G) motif was found # * if at least one C and one G motif were found. # # A motif is defined as a succession of exactly 4 C or G. # # Boundary conditions: # # If a motif straddles a boundary between two samples, the right most # sample will get this motif counted. sub parse_gene { my $gene = shift; my $sample_len = shift || 50; my $pgene = ""; # Split up the gene in blocks of specified length # # If length($gene) is not a multiple of $sample_len, the last # sample will be shorter than the others.
{ # Count the number of C and G motifs entirely within the # string my $ng = () = $sample =~ /G{4}/g; my $nc = () = $sample =~ /C{4}/g; # Find the largest number of G or C at the end of the sample # that have not already been counted. # First remove everything which isn't a block of G or C at # the end, and reset $sample if there is no match. # Then remove all blocks that have already been counted # This could be done in one go, but the backtracking slows # things down unnecessarily. two steps is much faster. $sample =~ s/.*?(G+|C+)\z/$1/ or $sample = ""; $sample =~ s/(G{4}|C{4})+//; if ($sample) { # We have found a possible start of a motif at the end of # the current string. We simply prepend it to the next # sample so it can be counted there. We only do this, if # there is a next sample.
} $pgene .= ($ng && $nc) ? "*" : ($ng > 1 ) ? "G" : ($ng == 1 ) ? "g" : ($nc > 1 ) ? "C" : ($nc == 1 ) ? "c" : "-" ; } return $pgene; Quote: }
__END__ I used this program to generate 'test genes': #!/usr/local/bin/perl -w use strict; # gen_gene($len, $sample_len) # # Create a "gene" of length $len (default 2000) with samples of length # $sample_len (default 6). Both arguments are optional. # Repeatedly pick a random $sample_len section from a template. # sub gen_gene { # The srand is here to create reproducibility. BEGIN { srand 123456 } my $template = "ACCACCGGTCCATCAGGGTGGTTACCGTGCCCAATCCGTCTGGGCTATTCAGTC"; my $len = shift || 2000; my $sample_len = shift || 6; my $out = ""; while (length $out < $len) { my $off = int(rand length($template) - $sample_len); $out .= substr $template, $off, $sample_len; } return substr $out, 0, $len; Quote: }
print gen_gene(2000), " gene $_\n" for ("A".."F"); __END__ output of: $ ./gen_gene.pl | dna.pl --g--gc---------g-ggc------g------c---*------g---- gene A ---------*cgcg-c-----g--c--c*gg-----------g----c-- gene B -c--c------c------c-----g-----g---cggcg*---c--g-c- gene C -----g------g----c-----*-----gcgc-----g-----cgG--- gene D C--g--C------gg--gg-------g-g---G-g---cc-----*---- gene E -g-c--c-g-c----g------g--Cgc-----gg----------ggcc- gene F I suspect that my test genes don't really represent nature very well. Your example didn't contain quite as many hits as I'm seeing here :) If you're on a system with a different rand() than mine, your genes will vary :) I don't want to post the full fake genes I used, since that would be 6 x 2000 bytes or so. Bit too much. Martien -- Martien Verbruggen | | Unix is user friendly. It's just Trading Post Australia Pty Ltd | selective about its friends. |
|
Tue, 06 Apr 2004 07:09:01 GMT |
|
|
Benjamin Goldber #15 / 24
|
Scaling a DNA string
Quote:
> Hi, > I've finally arrived back were I can post to this newsgroup, and I see > that people have been posting great responses all day long. The > answer to my problem might lie in this thread; it will take me a while > to wrap my mind around some of the more complicated perl. But her is > a clarification of the biology of the problem. > Martien is right; I am just looking for some decent compression so > that the data is more visually manageable. What I didn't tell you is > that I have hundreds of these strings, that correspond to hundreds of > genes and that I'm going to need to print this out on ordinary letter > size paper. Here is what that might look like when done (requires a > non-proportional font to look right): > --------G--------------------G---------- gene A > ----------------CG--G------------------- gene B > ---------------G-G-C--C----------------- gene C > -----------G---------------------------- gene D > ---------------------------------------- gene E > --C------------------------------------- gene F > ---------------GGG--C--G------------G--- gene G > ---------------------------------------- gene H > -----------C---------------------------- gene I > In this example, each string was derived from 2000 characters, so each > dash or letter represents 50 characters. A dash means that in those > 50 characters, no motifs are found. A "G" or "C" means that in those > 50 characters, one or more motifs were found.
[snip] Quote: > Martin asked the clever question: > "GGGG--CCCC" => G or C or something else? > "GGGG-GGGG-" => Only one G? or something else? > To the latter, initially I called a single GGGG in a region "g" while > more than one motif garnered a "G", but when I started to have trouble > scaling I eliminated that complications from my code. The answer to > the former is the same. It would be usefully to call a region with > both GGGG and CCCC in it some other letter (S perhaps) > He also asked what I want to do with motifs that straddle the > boundary. The answer is, put the mark on one side or the other but > not both. Which side does not really matter, in the total 2000 > characters I won't be off by that much. > Some wonder what exactly defines a motif. Is it GGGG or GGGGG and is > GGGGGGGG one motif or two. The answer is flexible. Eight G's in a > row is certainly two motifs; four G's is one motif. Between 4 and 8
When compressing 2000 characters to, say, 40, the odds are that a pair of adjacent motifs of the same type will map to the same output slot. This sortof makes it a non-issue. Quote: > could be one or two motifs, whatever is easier to implement. I've > been calling it one motif for now.
#!/usr/local/bin/perl5.6 -w $_ = "ACGACGTCCAGGGGGGTTGTTACGTCCCCAATCAGTCGGGGCTATTCAGTC"; my $scale = 4; my $motifs = ''; vec( $motifs, $+[0]/$scale, 2 ) |= 1 while m/G{4}/g; vec( $motifs, $+[0]/$scale, 2 ) |= 2 while m/C{4}/g;
__END__ This [tested!] code prints: ---G---C--G-- For big strings, set $scale to something bigger, like 50. Or, if you want output more "cleverly" scaled, change it to: my $scale = length($_) / 72; # where 72 is the width of your screen. -- "What does stupid old man mean pidgin talk? Shampoo does not talk like a bird."
|
Tue, 06 Apr 2004 08:16:37 GMT |
|
|
Page 1 of 2
|
[ 24 post ] |
|
Go to page:
[1]
[2] |
|