RFI: Counting repeated substrings within a string 
Author Message
 RFI: Counting repeated substrings within a string

Hi,

 I'm working on a cryptology problem.  I would like to create
a perl script that will read in a cyphertext file, and find all
repeated substrings within that file.

 I have some experience with perl, but so far I have gotten very far.
If anyone could give me some hints about how to go about this,
I'd appreciate it.

Ex: An input of 'werasdwerasdwer' would return all of the times
that 'wer', 'era', 'ras', 'asd' occured.  (For the assignment, I'm just
going to be looking at repeated patterns with 4 or more characters, then
do some frequency study on the results to try and crack the cyphertext.)

Thanks for any help,
Alfred Landrum



Thu, 12 Oct 2000 03:00:00 GMT  
 RFI: Counting repeated substrings within a string

[posted and mailed]

Quote:

> Hi,

>  I'm working on a cryptology problem.  I would like to create
> a perl script that will read in a cyphertext file, and find all
> repeated substrings within that file.

> Ex: An input of 'werasdwerasdwer' would return all of the times
> that 'wer', 'era', 'ras', 'asd' occured.  (For the assignment, I'm just
> going to be looking at repeated patterns with 4 or more characters, then
> do some frequency study on the results to try and crack the cyphertext.)

How many more characters?

This one does up to the length of the entire string.

for ($pos=0; $pos <= length($_) - 4; ++$pos) {
  for ($len=4; $len <= length($_) - $pos; ++$len) {
    $substr = substr($_, $pos, $len);
    next if exists $freq{$substr};
    $freq{$substr} = () = /\Q$substr/g;
  }

Quote:
}

--


    /                                   http://www.ziplink.net/~rjk/
        "It's funny 'cause it's true ... and vice versa."


Thu, 12 Oct 2000 03:00:00 GMT  
 RFI: Counting repeated substrings within a string


Quote:

> Hi,

>  I'm working on a cryptology problem.  I would like to create
> a perl script that will read in a cyphertext file, and find all
> repeated substrings within that file.

>  I have some experience with perl, but so far I have gotten very far.
> If anyone could give me some hints about how to go about this,
> I'd appreciate it.

> Ex: An input of 'werasdwerasdwer' would return all of the times
> that 'wer', 'era', 'ras', 'asd' occured.  (For the assignment, I'm just
> going to be looking at repeated patterns with 4 or more characters, then
> do some frequency study on the results to try and crack the cyphertext.)

======================

I don't know why I'm doing a student's homework for him, but I couldn't resist.

I'm not claiming this to be the fastest algorithm possible, but it'll get
you started.

-----------------------------

#!/usr/local/bin/perl -w
# Another fine example of Newbieware(TM)
use strict;


my $result   = '';

my $s = 'werasdwerasdwerwerasdwerasdwerwerasdwerasdwer';
my $minlen   =  2;
my $maxlen   = 16;

for (my $patlen = $minlen; $patlen <= $maxlen; $patlen++) {

   for (my $i = 0; $i <= length($s)-$patlen; $i++) {

      my $ss = substr($s, $i, $patlen);

   }

Quote:
}


   $result .= $_.' : '.$s =~ s/$_/$_/g."\n";

Quote:
}

print $result;

# This hash could be used to analyze the results
# (sort by frequency, etc.):

__END__

Use this for legal purposes only, eh? :-)

A.L.,
who obviously has no life



Thu, 12 Oct 2000 03:00:00 GMT  
 RFI: Counting repeated substrings within a string


Quote:
>  I'm working on a cryptology problem.  I would like to create
> a perl script that will read in a cyphertext file, and find all
> repeated substrings within that file.

Well, you should probably put a limit on the minimum length of such a
substring - substrings of length one are too common to be generally
useful! :-)  

But here's some code like what I've done for some cryptology I've done in
Perl; it's fairly fast. Note that this won't report 'alf' from 'alfalfa',
since it finds 'alfa' in there. If you want those shorter strings, though,
they're easy to generate from the ones given.

Hope this helps!

--
Tom Phoenix       Perl Training and Hacking       Esperanto
Randal Schwartz Case:     http://www.rahul.net/jeffrey/ovs/

#!/usr/bin/perl -w

use strict;

sub repstrs ($) {
    # returns a list of repeated substrings in the given string
    # (Given string must not contain any null characters.)

    my($copy, $mask, %hash);
    for (
            $copy = substr($string,1) . "\0";
            $copy =~ /[^\0]/;
            $copy = substr($copy,1) . "\0") {
        $mask = $string ^ $copy;
        while ($mask =~ /(\0{2,})/g) {
            my $len = length $1;
            $hash{ substr($string, pos($mask)-$len, $len) }++;
        }
    }
    keys %hash;

Quote:
}

for ('werasdwerasdwer', 'this is a test with some repeated substrings',
        'fred is a man who is a fanatic fan of manfred mann', 'alfalfa'
        ) {
    print "$_\n", map("  '$_'\n", repstrs($_)), "\n";
Quote:
}

__END__


Thu, 12 Oct 2000 03:00:00 GMT  
 RFI: Counting repeated substrings within a string

Quote:

> Ex: An input of 'werasdwerasdwer' would return all of the times
> that 'wer', 'era', 'ras', 'asd' occured.  (For the assignment, I'm just
> going to be looking at repeated patterns with 4 or more characters, then
> do some frequency study on the results to try and crack the cyphertext.)

Not that I like doing other people's homework, but this is a somewhat
interesting problem....

#!/usr/local/bin/perl

$i='werasdwerasdwer';
$chars = 3; # look at patters with 3 or more chars...

map {$count{substr $i, $_, $chars}++} ( 0 .. (length($i)-$chars));
print map {"$_ : $count{$_}\n"} keys %count;

zone:~ >perl /tmp/ct.pl
dwe : 2
asd : 2
era : 2
sdw : 2
ras : 2
wer : 3

You missed a few that were repeated.  This can probably be optimized
a bit...

Reza



Thu, 12 Oct 2000 03:00:00 GMT  
 RFI: Counting repeated substrings within a string


MCMXCIII in <URL: :">
++ [posted and mailed]
++
++ >
++ > Hi,
++ >
++ >  I'm working on a cryptology problem.  I would like to create
++ > a perl script that will read in a cyphertext file, and find all
++ > repeated substrings within that file.
++ >
++ > Ex: An input of 'werasdwerasdwer' would return all of the times
++ > that 'wer', 'era', 'ras', 'asd' occured.  (For the assignment, I'm just
++ > going to be looking at repeated patterns with 4 or more characters, then
++ > do some frequency study on the results to try and crack the cyphertext.)
++
++ How many more characters?
++
++ This one does up to the length of the entire string.
++
++ for ($pos=0; $pos <= length($_) - 4; ++$pos) {
++   for ($len=4; $len <= length($_) - $pos; ++$len) {
++     $substr = substr($_, $pos, $len);
++     next if exists $freq{$substr};
++     $freq{$substr} = () = /\Q$substr/g;
++   }
++ }

That would require a pretty big machine for any non trivial text.

If the length of the text is a few kilobytes, the number of subpatterns
will be counted in millions. Perl isn't the most suitable language for
the above algorithm.

Abigail
--
perl -wle '$, = " "; sub AUTOLOAD {($AUTOLOAD =~ /::(.*)/) [0];}
           print+Just (), another (), Perl (), Hacker ();'



Fri, 13 Oct 2000 03:00:00 GMT  
 RFI: Counting repeated substrings within a string

Quote:


> MCMXCIII in <URL: :">
> ++
> ++ for ($pos=0; $pos <= length($_) - 4; ++$pos) {
> ++   for ($len=4; $len <= length($_) - $pos; ++$len) {
> ++     $substr = substr($_, $pos, $len);
> ++     next if exists $freq{$substr};
> ++     $freq{$substr} = () = /\Q$substr/g;
> ++   }
> ++ }

> That would require a pretty big machine for any non trivial text.

> If the length of the text is a few kilobytes, the number of subpatterns
> will be counted in millions. Perl isn't the most suitable language for
> the above algorithm.

Well, I did ask the poster how many characters long he wanted the substrings
to be.  :-)


hash values makes a lot more sense.  Guess I was trying to be too clever.

$minlen = 4;
$maxlen = 6;
for ($pos=0; $pos <= length($_) - $minlen; ++$pos) {
  for ($len=$minlen;
       ($len <= $maxlen) and ($pos+$len <= length $_);
       ++$len) {
    $substr = substr($_, $pos, $len);
    $freq{$substr}++;
  }

Quote:
}

--


    /                                   http://www.ziplink.net/~rjk/
        "It's funny 'cause it's true ... and vice versa."


Fri, 13 Oct 2000 03:00:00 GMT  
 
 [ 8 post ] 

 Relevant Pages 

1. counting substrings in strings

2. Substrings within strings

3. Count the number of occurance within a string

4. fast ways to count substrings

5. counting substrings

6. Extracting substrings from strings

7. how can I cut a string into specific length substrings

8. simplifying count of a string in another string?

9. Repeat string matching/substituting ?

10. repeated string

11. Finding a string within a string

12. Counting values within a file

 

 
Powered by phpBB® Forum Software