Scaling a DNA string 
Author Message
 Scaling a DNA string

Hi,

I have what is a tricky biology problem, but an easy perl problem.
Unfortunately, while I'm competent biologist, I'm a complete newbie at perl.

As you may know, the DNA which codes for all the instructions of life is
simply a string composed of the four characters A, C, G, or T.  We have
experimental evidence that suggests that a motif composed of four G's or
four C's in a row might have implications for how some genes are regulated.
But, it would matter how close these motifs are to the genes they are
suppose to control.

So, I grabbed 2000 characters from nearby an important gene and stored it in
a string.  Here is what part of that might look like:

...ACGACGTCCAGGGGGGTTGTTACGTCCCCAATCAGTCGGGGCTATTCAGTC...

Next I replaced all the unimportant characters with at dash so we can see
where these motifs are:

...----------GGGGGG---------CCCC--------GGGG----------...

But as you might imagine, I'm having trouble printing out these 2000
character long strings in a readable format.  What I need to do is scale the
string down to a reasonable size.  What I've tried to do is use a for loop
with and index of 10 to replace all the instances or 10 dashes with a single
dash.  If a G or C is found in the 10 character region, a single G or C is
printed.  Here is how the scaled version looks:

...-GC-GG-...

The problem is that someone looking at this scaled version would think that
this string of DNA characters has four motifs in this short region.  But it
only has three.  The last GGGG gets counted twice because it is crosses over
two different 10 character long chunks.

And, if I scale the string by looking for GGGG or CCCC in any 10 character
long chunk, I will underestimated the number of motifs to this:

...-GC----...

What I really want to scale the string to is this ...-GC-G--... ( or this
...-GC--G-..., small position shifts are inconsequential as long as the
number of motifs is accurate).

I hope this is clear and not to long winded.  I figured the problem would be
more fun if the background context was included.  Here is the code I have
written so far to scale the string:

for ($i = 0; $i < length ($sequence); $i += 10) {
    if (substr ($sequence, $i, 10) =~ /G/i) {
        print "G"
    } elsif (substr ($sequence, $i, 10) =~/C/i) {
        print "C";
    } else {
        print "-";
    }

Quote:
}

TIA


Mon, 05 Apr 2004 03:52:57 GMT  
 Scaling a DNA string

Quote:
> Hi,

> I have what is a tricky biology problem, but an easy perl problem.
> Unfortunately, while I'm competent biologist, I'm a complete newbie at
perl.

> As you may know, the DNA which codes for all the instructions of life is
> simply a string composed of the four characters A, C, G, or T.  We have
> experimental evidence that suggests that a motif composed of four G's or
> four C's in a row might have implications for how some genes are
regulated.
> But, it would matter how close these motifs are to the genes they are
> suppose to control.

> So, I grabbed 2000 characters from nearby an important gene and stored it
in
> a string.  Here is what part of that might look like:

> ...ACGACGTCCAGGGGGGTTGTTACGTCCCCAATCAGTCGGGGCTATTCAGTC...

{snip}

Quote:
> What I really want to scale the string to is this ...-GC-G--... ( or this
> ...-GC--G-..., small position shifts are inconsequential as long as the
> number of motifs is accurate).

> I hope this is clear and not to long winded.  I figured the problem would
be
> more fun if the background context was included.  Here is the code I have
> written so far to scale the string:

Well, I am not really too clear on exactly what the rules are from the
above, but as a starting point I tried the following:

$sequence = 'ACGACGTCCAGGGGGGTTGTTACGTCCCCAATCAGTCGGGGCTATTCAGTC';
$sequence =~ s/([GC]){4}/\l$1/g; # Mark 4 char sets as a motif by a single
character (lowercase) of itself
$sequence =~ tr/GCAT/-/s; # turn everything else into a - and squash them
down
$sequence = uc($sequence); # Make the Motif uppercase again (for tidiness)
print "$sequence\n"; # Print the new version

This converts your sample string to:

-G-C-G-

But since I am not really too clear the exact rules that make up a motif (is
6 G's a motif or a motif and 2 G's ?, Are the character motifs supposed to
be on a 4 character boundary? (as in a full set?))

Why did you choose 10 as your substring length, when the sets are made up
from groups of 4?

Anyway, this might be a starting point to help you further along the way.

Wyzelli
--
($a,$b,$w,$t)=(' bottle',' of beer',' on the wall','Take one down, pass it
around');
for(reverse(1..100)){$s=($_!=1)?'s':'';$c.="$_$a$s$b$w\n$_$a$s$b\n$t\n";
$_--;$s=($_!=1)?'s':'';$c.="$_$a$s$b$w\n\n";}print"$c*hic*";



Mon, 05 Apr 2004 04:39:51 GMT  
 Scaling a DNA string

Quote:

> What I really want to scale the string to is this ...-GC-G--... ( or this
> ...-GC--G-..., small position shifts are inconsequential as long as the
> number of motifs is accurate).

> I hope this is clear and not to long winded.  I figured the problem would be
> more fun if the background context was included.

Very clear, and an interesting problem.

Here's my approach, which shows the relative lengths of the gaps between
motifs using a logarithmic scale. I'm assuming that any sequence of four
*or more* Cs or Gs is a motif, but you can change the first argument of
the split as needed.

You can fiddle with the base of the logarithm (2 in this example) to get
different scalings.

-- Dave Tweed

$sequence = 'ACGACGTCCAGGGGGGTTGTTACGTCCCCAATCAGTCGGGGCTATTCAGTC';

# split the sequence into motifs and non-motifs




    # show the relative length of the non-motif using a logarithmic scale
    print '-' x int((log length $non)/log 2);
    # show the motif as a single character
    print chop $motif if defined $motif;

Quote:
}

print "\n";


Mon, 05 Apr 2004 06:29:34 GMT  
 Scaling a DNA string

Quote:

>a motif composed of four G's or
>four C's in a row might have implications for how some genes are regulated.

>So, I grabbed 2000 characters from nearby an important gene and stored it in
>a string.  I replaced all the unimportant characters with at dash so we can see
>where these motifs are:

>...----------GGGGGG---------CCCC--------GGGG----------...

>What I need to do is scale the
>string down to a reasonable size.  What I've tried to do is use a for loop
>with and index of 10 to replace all the instances or 10 dashes with a single
>dash.  If a G or C is found in the 10 character region, a single G or C is
>printed.  

>The last GGGG gets counted twice because it is crosses over
>two different 10 character long chunks.
>And, if I scale the string by looking for GGGG or CCCC in any 10 character
>long chunk, I will underestimated the number of motifs

>What I really want to scale the string to is this ...-GC-G--... ( or this
>...-GC--G-..., small position shifts are inconsequential as long as the
>number of motifs is accurate).

Summarizing the apparent key points,
1. Sequences of four G's or four C's in the string are important.
2. Condense sequences of -'s in the string.
3. Chewing on the string in 10-character chunks is not working well because
a single motif may cross the boundary between chunks.

How about crunching four or more G's or C's down to one,
  $sequence =~ s/([GC])\1{3,}/$1/g;
then crunching multiple -'s down to one, so the motifs are visually
separate.
  $sequence =~ s/(-)+/$1/g;

What should happen where there are three or fewer C's or G's in a row?  Keep
them, discard them, or some other notation indicating "something there, but
not what I'm really looking for"?



Mon, 05 Apr 2004 07:01:41 GMT  
 Scaling a DNA string
On Wed, 17 Oct 2001 22:52:57 -0400,

Quote:
> Hi,

> I have what is a tricky biology problem, but an easy perl problem.
> Unfortunately, while I'm competent biologist, I'm a complete newbie at perl.

Does that mean that these problems are approximately equally hard to
you? :)

Quote:
> As you may know, the DNA which codes for all the instructions of life is
> simply a string composed of the four characters A, C, G, or T.  We have
> experimental evidence that suggests that a motif composed of four G's or
> four C's in a row might have implications for how some genes are regulated.
> But, it would matter how close these motifs are to the genes they are
> suppose to control.

Ok, so far I'm still here... :)

Quote:
> So, I grabbed 2000 characters from nearby an important gene and stored it in
> a string.  Here is what part of that might look like:

> ...ACGACGTCCAGGGGGGTTGTTACGTCCCCAATCAGTCGGGGCTATTCAGTC...

> Next I replaced all the unimportant characters with at dash so we can see
> where these motifs are:

> ...----------GGGGGG---------CCCC--------GGGG----------...

> But as you might imagine, I'm having trouble printing out these 2000
> character long strings in a readable format.  What I need to do is scale the
> string down to a reasonable size.  What I've tried to do is use a for loop
> with and index of 10 to replace all the instances or 10 dashes with a single
> dash.  If a G or C is found in the 10 character region, a single G or C is
> printed.  Here is how the scaled version looks:

But what happens if both a G and a C are found in this 10 character
sequence? And what happens if more than one sequence falls in one
slice?

"GGGG--CCCC" => G or C or something else?

"GGGG-GGGG-" => Only one G? or something else?

Quote:
> ...-GC-GG-...

> The problem is that someone looking at this scaled version would think that
> this string of DNA characters has four motifs in this short region.  But it
> only has three.  The last GGGG gets counted twice because it is crosses over
> two different 10 character long chunks.

And indeed, what do you do with motifs that straddle the boundary? Or
worse, what if you have a string like this:

0123456789012345678901234567890123456789
---CCCC-GGGGG--CCCCCCC--GGGGGGGG--CCCC--

In the above, every 10-character sequence has at least one G and at
least one C. 3 out of 5 motifs straddle a boundary. There are 5
motifs, but only 4 10-character sequences.  How do you want that
expressed?

What about coming up with a totally different encoding scheme? One
that simply tells you how many repeats it's seen? Your string would be
encoded as

-10G6-9C4-8G4-10

or maybe

-10 G6 -9 C4 -8 G4 -10

Mine would be:

-3 C4 -1 G5 -2 C7 -2 G8 -2 C4 -2

The more repetition there is, the more compression you get. You could
even consider not putting the minuses in.

3 C4 1 G5 2 C7 2 G8 2 C4 2

All this is assuming that the length of these things, and the gap
between them, is somehow important. If it isn't, You probably would
have just compressed multiple characters into one (with tr///s, for
example)

I'm not sure whether that's readable enough, but at least it doesn't
suffer from the problems that you've outlined, and the extra problems
that I noted.

Assuming that you already have the minuses in the string:

$_ = "----------GGGGGG---------CCCC--------GGGG----------";
s/(([-CG])\2*)/ ($2 eq "-" ? "" : $2) . length($1) . " " /ge;

Of course, there are other ways to do this, even when starting with
the raw string, instead of the one with minuses.

Something like the following might be a good start:

while (m/(.*?)(([CG])\3{3,})/g)
{
    print length($1), " ", length($2), $3, " ";

Quote:
}

print "\n";

This misses out on the last sequence, if it isn't a motif. I have to
go home now, so I don't have time to think of a way to fix that
hurdle, without resorting to $' (if that works at all).  :)

Maybe pos() with substr() could be used, although I think pos() will
actually be undefined...

Martien
--
Martien Verbruggen              |
                                | That's not a lie, it's a
Trading Post Australia Pty Ltd  | terminological inexactitude.
                                |



Mon, 05 Apr 2004 07:52:32 GMT  
 Scaling a DNA string

Quote:


> >a motif composed of four G's or
> >four C's in a row might have implications for how some genes are
regulated.

> >So, I grabbed 2000 characters from nearby an important gene and stored it
in
> >a string.  I replaced all the unimportant characters with at dash so we
can see
> >where these motifs are:

> >...----------GGGGGG---------CCCC--------GGGG----------...

> >What I need to do is scale the
> >string down to a reasonable size.  What I've tried to do is use a for
loop
> >with and index of 10 to replace all the instances or 10 dashes with a
single
> >dash.  If a G or C is found in the 10 character region, a single G or C
is
> >printed.

> >The last GGGG gets counted twice because it is crosses over
> >two different 10 character long chunks.
> >And, if I scale the string by looking for GGGG or CCCC in any 10
character
> >long chunk, I will underestimated the number of motifs

> >What I really want to scale the string to is this ...-GC-G--... ( or this
> >...-GC--G-..., small position shifts are inconsequential as long as the
> >number of motifs is accurate).

> Summarizing the apparent key points,
> 1. Sequences of four G's or four C's in the string are important.
> 2. Condense sequences of -'s in the string.
> 3. Chewing on the string in 10-character chunks is not working well
because
> a single motif may cross the boundary between chunks.

> How about crunching four or more G's or C's down to one,
>   $sequence =~ s/([GC])\1{3,}/$1/g;
> then crunching multiple -'s down to one, so the motifs are visually
> separate.
>   $sequence =~ s/(-)+/$1/g;

> What should happen where there are three or fewer C's or G's in a row?
Keep
> them, discard them, or some other notation indicating "something there,
but
> not what I'm really looking for"?

Hi Jay,

Thanks for your quick response.  I'll answer your last question first.  If
there are three or fewer C's or G's in a row they can be ignore (replaced
with -'s).

The problem with crunching four or more G's or C's down to one and then
crunching multiple -'s down to one is that the string would not be evenly
scaled.  What I need to know is not only how many motifs are in the
sequence, but were they are in relation to each other and to the ends of the
string.  For example if a GGGG is at the far right of the string, I would
want it to look like this when scaled:

-----------------------G---

But if two GGGG motifs are in the middle of the string, it should look like
this:

----------G-G--------------

Your answer does give me an idea though.  If I scale by 2, i.e.. replace
all --'s with a single dash and all GG or CC with a single G or C I won't
have the problem of a motif crossing a boundary.  I could then repeat this
two more times, leaving a string of length 125.  This might be small enough
for my purposes.  The only glitch now, is how to tell GGGGGG that scaled to
G or GGGGGGGG (two motifs) that also scales to G.

TIA



Mon, 05 Apr 2004 13:37:36 GMT  
 Scaling a DNA string

Quote:
> On Wed, 17 Oct 2001 22:52:57 -0400,

>> As you may know, the DNA which codes for all the instructions of life
>> is simply a string composed of the four characters A, C, G, or T.  We
>> have experimental evidence that suggests that a motif composed of
>> four G's or four C's in a row might have implications for how some
>> genes are regulated. But, it would matter how close these motifs are
>> to the genes they are suppose to control.
[snip]
>> So, I grabbed 2000 characters from nearby an important gene and
>> stored it in a string.  Here is what part of that might look like:

>> ...ACGACGTCCAGGGGGGTTGTTACGTCCCCAATCAGTCGGGGCTATTCAGTC...

[snip]
> What about coming up with a totally different encoding scheme? One
> that simply tells you how many repeats it's seen? Your string would be
> encoded as

> -10G6-9C4-8G4-10

> or maybe

> -10 G6 -9 C4 -8 G4 -10

This works, I guess, but it's visually cluttered and hard to tell what's
happening at a glance.  Why not just replace everything but the motifs of
interest with spaces, and then print them with a ruler over them so you
can get their distance from the beginning of the sequence?  What I have in
mind is the following.  $seq contains the gene sequence of interest.

my $bases_per_line = 50;
my $num_rulers = $bases_per_line / 10;

$seq =~ s/(G{4,})/lc $1/ge;
$seq =~ s/(C{4,})/lc $1/ge;
$seq =~ tr/AGCT/ /;
$seq = uc $seq;

my $ruler = '----+----';
my $count=0;
while ( $seq =~ /(.{$bases_per_line})/g ) {
    print ' 'x5;
    my $ruler_start = $count;
    for (1 .. $num_rulers) {
        $count += 10;
        print $ruler, $_;
    }
    printf "\n%4d %s\n", $ruler_start, $1;

Quote:
}

Some sample output:

     ----+----1----+----2----+----3----+----4----+----5
   0         GGGGG            CCCCCC GGGG              
     ----+----1----+----2----+----3----+----4----+----5
  50                      GGGG                        
     ----+----1----+----2----+----3----+----4----+----5
 100        CCCC                                      

IMHO this is much easier to understand, although it uses a lot of
screen/page space.  (It's not an original idea with me;  I saw something
similar in some gene sequencing data when I was working as a research
assistant for a biostats professor.)

Oh, I almost forgot;  the above requires a non-proportional font, so if
for some strange reason you're reading this newsgroup in a proportional
font (why?), it won't look right.

--
David Wall



Mon, 05 Apr 2004 16:39:44 GMT  
 Scaling a DNA string

Quote:

> my $bases_per_line = 50;
> my $num_rulers = $bases_per_line / 10;

Oops.  I guess it's obvious, but I only coded this for $bases_per_line to
be multiples of 10.  I'm sure it could be made more general, but 10's were
easiest, and perhaps best.

--
David Wall



Mon, 05 Apr 2004 16:56:48 GMT  
 Scaling a DNA string
On Thu, 18 Oct 2001 08:37:36 -0400, "DocDodge"

Quote:



>> Summarizing the apparent key points,
>> 1. Sequences of four G's or four C's in the string are important.
>> 2. Condense sequences of -'s in the string.
>> 3. Chewing on the string in 10-character chunks is not working well
>> because a single motif may cross the boundary between chunks.

>> How about crunching four or more G's or C's down to one,
>>   $sequence =~ s/([GC])\1{3,}/$1/g;
>> then crunching multiple -'s down to one, so the motifs are visually
>> separate.
>>   $sequence =~ s/(-)+/$1/g;

>> What should happen where there are three or fewer C's or G's in a row?
>> Keep them, discard them, or some other notation indicating "something there,
>> but not what I'm really looking for"?

>Thanks for your quick response.  I'll answer your last question first.  If
>there are three or fewer C's or G's in a row they can be ignore (replaced
>with -'s).

>The problem with crunching four or more G's or C's down to one and then
>crunching multiple -'s down to one is that the string would not be evenly
>scaled.  What I need to know is not only how many motifs are in the
>sequence, but were they are in relation to each other and to the ends of the
>string.  
>Your answer does give me an idea though.  If I scale by 2, i.e.. replace
>all --'s with a single dash and all GG or CC with a single G or C I won't
>have the problem of a motif crossing a boundary.  I could then repeat this
>two more times, leaving a string of length 125.  This might be small enough
>for my purposes.  The only glitch now, is how to tell GGGGGG that scaled to
>G or GGGGGGGG (two motifs) that also scales to G.

I see better now.  I completely missed that the scaling factor
should be constant.  Adding a couple more things to the list of
requirements above,
4. Sequences of three or fewer C's or G's can be ignored.
5. A single character in the output string will represent
(roughly) X characters in the input.

See how this grabs you.

Change chains of four G/C to a single lowercase g/c, because
those are what's important and to prevent pulverizing them in the
next step
  $sequence =~ s/([GC])\1{3}/lc $1/eg;

Because G/C in the remaining string only exist in chains of three
or fewer, which can be ignored, change any remaining G/C into -
  $sequence =~ tr/GC/--/;

Scale chains of -'s down by a constant factor, say 8.
1 to 8 become 1, 9 to 16 become 2, etc.
  my $scale = 8; # In case you want to play with it
  $sequence =~ s!(-+)!'-' x (.9 + length($1)/$scale)!eg;
                             ^^^^^
Nuke the ".9 +" part if you want to scale 1 to 8 down to 0, 9 to
16 to 1, etc.

And if it's important, make G/C uppercase again.
  $sequence = uc $sequence;

It won't give an exact 1:X scaling.  You can't tell exactly where
things are in the original string by looking at the result, but
you can at least gauge approximate location.



Mon, 05 Apr 2004 18:08:43 GMT  
 Scaling a DNA string


Quote:
>Scale chains of -'s down by a constant factor, say 8.
>1 to 8 become 1, 9 to 16 become 2, etc.
>  my $scale = 8; # In case you want to play with it
>  $sequence =~ s!(-+)!'-' x (.9 + length($1)/$scale)!eg;

That's real damn ugly and silly.  If I'd thought just a bit
longer before posting I would have said

   $sequence =~ s/-{1,$scale}/-/g;



Mon, 05 Apr 2004 18:21:45 GMT  
 Scaling a DNA string

Quote:
> On Wed, 17 Oct 2001 22:52:57 -0400,

>> As you may know, the DNA which codes for all the instructions of life
>> is simply a string composed of the four characters A, C, G, or T.  We
>> have experimental evidence that suggests that a motif composed of
>> four G's or four C's in a row might have implications for how some
>> genes are regulated. But, it would matter how close these motifs are
>> to the genes they are suppose to control.

[snip]

Quote:
>> So, I grabbed 2000 characters from nearby an important gene and
>> stored it in a string.  Here is what part of that might look like:

>> ...ACGACGTCCAGGGGGGTTGTTACGTCCCCAATCAGTCGGGGCTATTCAGTC...

>> Next I replaced all the unimportant characters with at dash so we can
>> see where these motifs are:

>> ...----------GGGGGG---------CCCC--------GGGG----------...

>> But as you might imagine, I'm having trouble printing out these 2000
>> character long strings in a readable format.  What I need to do is
>> scale the string down to a reasonable size.

[snip]

Quote:

> What about coming up with a totally different encoding scheme? One
> that simply tells you how many repeats it's seen? Your string would be
> encoded as

> -10G6-9C4-8G4-10

> or maybe

> -10 G6 -9 C4 -8 G4 -10

> Mine would be:

> -3 C4 -1 G5 -2 C7 -2 G8 -2 C4 -2

> The more repetition there is, the more compression you get. You could
> even consider not putting the minuses in.

> 3 C4 1 G5 2 C7 2 G8 2 C4 2

Now I feel silly.  I like this idea better than what I did first, but with
a slight modification:

$sequence =~ s/(C{4,}|G{4,})/lc $1/ge;
$sequence =~ tr/AGTC/-/;
$sequence =~ s/(-+)/" ".(length $1)." "/ge;
$sequence =~ s/^ +//;

my $count = 1;
for ( split /\s+/, $sequence ) {
    printf "%6d ", $count;
    my $position = $count + 1;
    if ( /\d+/ ) {
        $count += $_ - 1;
    }
    else         {
        $count += length($_) + 1;
        printf "%20s at %5d \n", $_, $position;
    }

Quote:
}

printf "%6d\n", $count;
__END__

Sample output:

     1    520                 gggg at   521
   525    549                ccccc at   550
   555    817                ccccc at   818
   823   1040                 cccc at  1041
  1045   1105                ccccc at  1106
  1111   1120                ggggg at  1121
  1126   1187                 gggg at  1188
  1192   1232              ggggggg at  1233
  1240   1270                 gggg at  1271
  1275   1383                 gggg at  1384
  1388   1463                 cccc at  1464
  1468   1484                ccccc at  1485
  1490   1541                 cccc at  1542
  1546   1681                 gggg at  1682
  1686   1738                ggggg at  1739
  1744   1977                 cccc at  1978
  1982   2000

From position 1 to 520, no motifs.  Then we have gggg at 521 to 524,
nothing interesting from 525 to 549, etc...  The two leftmost columns
aren't really necessary, but they were useful in detecting off-by-one
errors in the code.  I'm sure other tweaks can be made if this suits the
OP's purpose, as well as cleaning up the code a bit.

--
David Wall



Mon, 05 Apr 2004 21:52:05 GMT  
 Scaling a DNA string
On Thu, 18 Oct 2001 20:52:05 -0000,

Quote:

>> The more repetition there is, the more compression you get. You could
>> even consider not putting the minuses in.

>> 3 C4 1 G5 2 C7 2 G8 2 C4 2

> Now I feel silly.  I like this idea better than what I did first, but with
> a slight modification:

Looks good too.

I just assumed that the original poster's main goal was to get some
decent compression, so that the data woiuldn't look as large anymore.
Your's is much more readable than mine, of course :)

I think it's probably time for the OP to speak up and make sure that
we don't go off on wild tangents that are unrelated to what is wanted.
There are quite a few ideas in this thread that they could use. Witha
bit of picking and choosing they should be able to get the better
idea.

From a clean readable reporting point of view, I think your solution
wins hands down :)

Martien
--
Martien Verbruggen              |
                                |
Trading Post Australia Pty Ltd  | Curiouser and curiouser, said Alice.
                                |



Tue, 06 Apr 2004 00:33:52 GMT  
 Scaling a DNA string
Hi,

I've finally arrived back were I can post to this newsgroup, and I see that
people have been posting great responses all day long.  The answer to my
problem might lie in this thread; it will take me a while to wrap my mind
around some of the more complicated perl.  But her is a clarification of the
biology of the problem.

Martien is right; I am just looking for some decent compression so that the
data is more visually manageable.  What I didn't tell you is that I have
hundreds of these strings, that correspond to hundreds of genes and that I'm
going to need to print this out on ordinary letter size paper.  Here is what
that might look like when done (requires a non-proportional font to look
right):

--------G--------------------G---------- gene A
----------------CG--G------------------- gene B
---------------G-G-C--C----------------- gene C
-----------G---------------------------- gene D
---------------------------------------- gene E
--C------------------------------------- gene F
---------------GGG--C--G------------G--- gene G
---------------------------------------- gene H
-----------C---------------------------- gene I

In this example, each string was derived from 2000 characters, so each dash
or letter represents 50 characters.  A dash means that in those 50
characters, no motifs are found.  A "G" or "C" means that in those 50
characters, one or more motifs were found.  I know the resolution of exactly
were the motifs lie is not so good, but when I look at this I can gleam a
lot of information.  You see, I know what these genes do in most cases.  And
if I know that gene B, gene C, and gene G are all involved in cancer, that
cluster of motifs at about the 1000 character mark has meaning.  This
pattern would not be likely to occur by chance, and so I will perform some
biological experiments to see if the GGGG or CCCC are found near other
cancer genes, or what the motif can do to ordinary genes.

Martin asked the clever question:

"GGGG--CCCC" => G or C or something else?

"GGGG-GGGG-" => Only one G? or something else?

To the latter, initially I called a single GGGG in a region "g" while more
than one motif garnered a "G", but when I started to have trouble scaling I
eliminated that complications from my code.  The answer to the former is the
same.  It would be usefully to call a region with both GGGG and CCCC in it
some other letter (S perhaps) but I'm still taking baby steps He also asked
what I want to do with motifs that straddle the boundary.  The answer is,
put the mark on one side or the other but not both.  Which side does not
really matter, in the total 2000 characters I won't be off by that much.

Some wonder what exactly defines a motif.  Is it GGGG or GGGGG  and is
GGGGGGGG one motif or two.  The answer is flexible.  Eight G's in a row is
certainly two motifs; four G's is one motif.  Between 4 and 8 could be one
or two motifs, whatever is easier to implement.  I've been calling it one
motif for now.

Finally, sorry for not having a stitch of perl code in this post.  I'll be
up much of the night working on this problem, so if the interest is still
there tomorrow perhaps I can say more.



Tue, 06 Apr 2004 03:09:35 GMT  
 Scaling a DNA string
On Thu, 18 Oct 2001 22:09:35 -0400,

Quote:
> Hi,

> I've finally arrived back were I can post to this newsgroup, and I see that
> people have been posting great responses all day long.  The answer to my
> problem might lie in this thread; it will take me a while to wrap my mind
> around some of the more complicated perl.  But her is a clarification of the
> biology of the problem.

Ok, now that a few of the questions have been settled, here's a small
subroutine that (I think) does what you want. It seems rather longish,
but most of it is comments.

Before relying on its correctness, let it float around this group for
a few days. I suspect that I made an error here or there, or that
someone has a faster way to do what I do with my two substitutions.
Also, it is possible that I still have misunderstood things.

#!/usr/local/bin/perl -w
use strict;

# For this example, data file contents should be
#  CGACCACCGGCTATGGTTACCCGTGCGTCCATCC Name of gene

while (<>)
{
    my ($gene, $name) = split ' ', $_, 2;
    my $pgene = parse_gene($gene, 40);
    print "$pgene $name";

Quote:
}

# parse_gene($gene, $sample_len)
#
# Break up a gene in $sample_len blocks. For each block, output
#
#  -        if nothing interesting was found
#  c or g   if one C or G motif was found
#  C or G   if more than one C(G) motif was found
#  *        if at least one C and one G motif were found.
#
# A motif is defined as a succession of exactly 4 C or G.
#
# Boundary conditions:
#
# If a motif straddles a boundary between two samples, the right most
# sample will get this motif counted.

sub parse_gene
{
    my $gene       = shift;
    my $sample_len = shift || 50;
    my $pgene      = "";

    # Split up the gene in blocks of specified length
    #
    # If length($gene) is not a multiple of $sample_len, the last
    # sample will be shorter than the others.


    {
        # Count the number of C and G motifs entirely within the
        # string
        my $ng = () = $sample =~ /G{4}/g;
        my $nc = () = $sample =~ /C{4}/g;

        # Find the largest number of G or C at the end of the sample
        # that have not already been counted.
        # First remove everything which isn't a block of G or C at
        # the end, and reset $sample if there is no match.
        # Then remove all blocks that have already been counted
        # This could be done in one go, but the backtracking slows
        # things down unnecessarily. two steps is much faster.
        $sample =~ s/.*?(G+|C+)\z/$1/ or $sample = "";
        $sample =~ s/(G{4}|C{4})+//;

        if ($sample)
        {
            # We have found a possible start of a motif at the end of
            # the current string. We simply prepend it to the next
            # sample so it can be counted there. We only do this, if
            # there is a next sample.

        }

        $pgene .= ($ng && $nc) ? "*"
                : ($ng  > 1  ) ? "G"
                : ($ng == 1  ) ? "g"
                : ($nc  > 1  ) ? "C"
                : ($nc == 1  ) ? "c"
                :                "-" ;
    }

    return $pgene;

Quote:
}

__END__

I used this program to generate 'test genes':

#!/usr/local/bin/perl -w
use strict;

# gen_gene($len, $sample_len)
#
# Create a "gene" of length $len (default 2000) with samples of length
# $sample_len (default 6). Both arguments are optional.
# Repeatedly pick a random $sample_len section from a template.
#
sub gen_gene
{
    # The srand is here to create reproducibility.
    BEGIN { srand 123456 }

    my $template   = "ACCACCGGTCCATCAGGGTGGTTACCGTGCCCAATCCGTCTGGGCTATTCAGTC";
    my $len        = shift || 2000;
    my $sample_len = shift || 6;
    my $out        = "";

    while (length $out < $len)
    {
        my $off = int(rand length($template) - $sample_len);
        $out .= substr $template, $off, $sample_len;
    }
    return substr $out, 0, $len;

Quote:
}

print gen_gene(2000), " gene $_\n" for ("A".."F");
__END__

output of:

$ ./gen_gene.pl | dna.pl
--g--gc---------g-ggc------g------c---*------g---- gene A
---------*cgcg-c-----g--c--c*gg-----------g----c-- gene B
-c--c------c------c-----g-----g---cggcg*---c--g-c- gene C
-----g------g----c-----*-----gcgc-----g-----cgG--- gene D
C--g--C------gg--gg-------g-g---G-g---cc-----*---- gene E
-g-c--c-g-c----g------g--Cgc-----gg----------ggcc- gene F

I suspect that my test genes don't really represent nature very well.
Your example didn't contain quite as many hits as I'm seeing here :)

If you're on a system with a different rand() than mine, your genes
will vary :) I don't want to post the full fake genes I used, since
that would be 6 x 2000 bytes or so. Bit too much.

Martien
--
Martien Verbruggen              |
                                | Unix is user friendly. It's just
Trading Post Australia Pty Ltd  | selective about its friends.
                                |



Tue, 06 Apr 2004 07:09:01 GMT  
 Scaling a DNA string

Quote:

> Hi,

> I've finally arrived back were I can post to this newsgroup, and I see
> that people have been posting great responses all day long.  The
> answer to my problem might lie in this thread; it will take me a while
> to wrap my mind around some of the more complicated perl.  But her is
> a clarification of the biology of the problem.

> Martien is right; I am just looking for some decent compression so
> that the data is more visually manageable.  What I didn't tell you is
> that I have hundreds of these strings, that correspond to hundreds of
> genes and that I'm going to need to print this out on ordinary letter
> size paper.  Here is what that might look like when done (requires a
> non-proportional font to look right):

> --------G--------------------G---------- gene A
> ----------------CG--G------------------- gene B
> ---------------G-G-C--C----------------- gene C
> -----------G---------------------------- gene D
> ---------------------------------------- gene E
> --C------------------------------------- gene F
> ---------------GGG--C--G------------G--- gene G
> ---------------------------------------- gene H
> -----------C---------------------------- gene I

> In this example, each string was derived from 2000 characters, so each
> dash or letter represents 50 characters.  A dash means that in those
> 50 characters, no motifs are found.  A "G" or "C" means that in those
> 50 characters, one or more motifs were found.

[snip]

- Show quoted text -

Quote:
> Martin asked the clever question:

> "GGGG--CCCC" => G or C or something else?

> "GGGG-GGGG-" => Only one G? or something else?

> To the latter, initially I called a single GGGG in a region "g" while
> more than one motif garnered a "G", but when I started to have trouble
> scaling I eliminated that complications from my code.  The answer to
> the former is the same.  It would be usefully to call a region with
> both GGGG and CCCC in it some other letter (S perhaps)
> He also asked what I want to do with motifs that straddle the
> boundary.  The answer is, put the mark on one side or the other but
> not both.  Which side does not really matter, in the total 2000
> characters I won't be off by that much.

> Some wonder what exactly defines a motif.  Is it GGGG or GGGGG  and is
> GGGGGGGG one motif or two.  The answer is flexible.  Eight G's in a
> row is certainly two motifs; four G's is one motif.  Between 4 and 8

When compressing 2000 characters to, say, 40, the odds are that a pair
of adjacent motifs of the same type will map to the same output slot.

This sortof makes it a non-issue.

Quote:
> could be one or two motifs, whatever is easier to implement.  I've
> been calling it one motif for now.

#!/usr/local/bin/perl5.6 -w
$_ = "ACGACGTCCAGGGGGGTTGTTACGTCCCCAATCAGTCGGGGCTATTCAGTC";
my $scale = 4;
my $motifs = '';
vec( $motifs, $+[0]/$scale, 2 ) |= 1 while m/G{4}/g;
vec( $motifs, $+[0]/$scale, 2 ) |= 2 while m/C{4}/g;


__END__

This [tested!] code prints:
---G---C--G--

For big strings, set $scale to something bigger, like 50.

Or, if you want output more "cleverly" scaled, change it to:
my $scale = length($_) / 72; # where 72 is the width of your screen.

--
"What does stupid old man mean pidgin talk?
Shampoo does not talk like a bird."



Tue, 06 Apr 2004 08:16:37 GMT  
 
 [ 24 post ]  Go to page: [1] [2]

 Relevant Pages 

1. Canvas->Scale and Tk::Scale conflict?

2. out of memory with long DNA sequence

3. parsing large DNA files into smaller files

4. DNA.pm 0.01

5. DNA.pm 0.01

6. script for parsing DNA sequence files

7. perl program to create fonts.scale for X11R5 Type1 font renderer

8. looking for guttman scaling algorithm

9. module naming proposal: Number::Scale

10. Tied hash not scaling - advice?

11. Scaling up sample data in an array of hashes

12. Scale/Mode calculator

 

 
Powered by phpBB® Forum Software