regexp: matching at least n chars out of a string of length m

Quote:

>I have a collection of strings which contain many occurances of the

>character ':'. I'd like to know, for each string, all the substrings of

>length $window than contain $want colons.

i think i understand this problem, so let me explain it back to you and

you tell me if i got it wrong: for a substring of length N, find all

substrings that have *at least* M :'s, but perhaps more. the : can be

anywhere in the substring. the substring has to be of length N - it

can't be shorter or longer.

(i hope that is right)

given that, and knowing that using matching operators will be horribly

inefficient, i propose a solution which i think is close to linear in

length of the substring (my suspicion is that it is a little less than

linear for most cases).

the philosophy of the algorithm is to start at the beginning of the

string and check the substring of length N for the right number of

occurances of :. if the right number are there, remember the offset

for that substring, then move the window along and check again, and

so on.

i haven't tested this extensively, so there might be quirks, but i

need to get back to the work i'm getting paid to do :)

#!/usr/bin/perl

#forgive me, i'm not a computer scientist...

my $string = 'asdF:asdf:safg:sg:saf:fsad::sfg:sdgfsdfg:::sdfg:sdfg:sfd';

my $window_size = 15;

my $number_needed = 5;

my $offset = 0;

my $length = length $string;

while( ($length - $offset) >= $window_size )

{

$current_window = substr($string, $offset, $window_size);

$number_found = ($current_window =~ tr/:/:/);

#if we found less than we needed, we can move the window

#ahead by the number of extra finds needed since the intervening

#substrings won't match. e.g. if we found M-2, moving

#the window one place will match *at most* M-1, so go ahead

#and skip that one too.

if( ($difference = ($number_needed - $number_found) ) > 0 )

{

$offset += $difference;

}

#if we found the exact number that is needed, remember this

#starting position and advance the window by one place

elsif( $difference == 0 )

{

}

#if we found more than that needed, skip ahead to the

#next place where there might be a possible match

# e.g. if we found M+2, then we can move the window

#two places ahead since moving

#the window by one place will find *at least* M+1, and

#so on. we just need to record the offsets

elsif( $difference < 0 )

{

while( $difference < 0 )

{

$difference++;

}

}

}

{

print substr($string, $_, $window_size), "\n";

}

__END__

There were 5 matches:

g:sg:saf:fsad::

:sg:saf:fsad::s

:saf:fsad::sfg:

fg:::sdfg:sdfg:

g:::sdfg:sdfg:s

:::sdfg:sdfg:sf

--

NY.pm - New York Perl M((o|u)ngers|aniacs)* <URL:http://ny.pm.org/>

CGI Meta FAQ <URL:http://computerdog.com/CGI_MetaFAQ.html>