fast ways to count substrings 
Author Message
 fast ways to count substrings

Previously, File::Tail would only handle single line entries delmited
with a hardcoded "\n".  The program counted lines using tr/\n//. Since
i wanted to handle multicharacter delimiters, i needed to count using
a different technique. File::Tail is often used on rapidly growing
logs, so the developer will not include my patch unless I can verify
that it is close to the speed of tr/\n//. I attempted a variety of
techniques with pure perl and then resorted to Inline.pm in order to
make use of theorectically faster C. The benchmarks from my 200mhz
Pentium are enclosed as well as the benchmarking code including the C
routines. My question to the group - "did I miss any faaast techniques
which can be used to count either single character or multi character
delimiters in a string of data?? the delimiter needs to be variable at
runtime. Actually File::Tail is object oriented, so the delimiter is
encapsulated in the File::Tail object".

Benchmark: timing 300 iterations of c_strncmp, c_strncmp_M, c_strstr, c_strstr_M, m_array,
m_array_M, m_for, m_for_M, m_while, m_while_M, tr...
 c_strncmp:  7 wallclock secs ( 7.44 usr +  0.00 sys =  7.44 CPU)
c_strncmp_M:  8 wallclock secs ( 7.31 usr +  0.00 sys =  7.31 CPU)
  c_strstr:  1 wallclock secs ( 1.36 usr +  0.00 sys =  1.36 CPU)
c_strstr_M:  1 wallclock secs ( 1.18 usr +  0.00 sys =  1.18 CPU)
   m_array: 14 wallclock secs (12.25 usr +  0.00 sys = 12.25 CPU)
 m_array_M:  5 wallclock secs ( 4.91 usr +  0.00 sys =  4.91 CPU)
     m_for: 12 wallclock secs (11.60 usr +  0.00 sys = 11.60 CPU)
   m_for_M:  5 wallclock secs ( 4.91 usr +  0.00 sys =  4.91 CPU)
   m_while:  8 wallclock secs ( 8.41 usr +  0.00 sys =  8.41 CPU)
 m_while_M:  4 wallclock secs ( 3.94 usr +  0.00 sys =  3.94 CPU)
        tr:  2 wallclock secs ( 1.31 usr +  0.00 sys =  1.31 CPU)

=== check counts ===
        _m_while_M  2000
        _c_strncmp  6000
            _m_for  6000
          _m_array  6000
         _c_strstr  6000
          _m_while  6000
      _c_strncmp_M  2000
               _tr  6000
          _m_for_M  2000
        _m_array_M  2000
       _c_strstr_M  2000

----- snip -----
#!/usr/bin/perl -w
use Benchmark('timethese');
use Inline C;

my $tststr="abcdefghijklmnopqrstuvwxyz01234567890\n";
my $dta=($tststr x 2000).( ($tststr."\n") x 1900).("\n" x 200);
my %cnt=();

my $_tr = sub {
   my $cnt=$dta=~tr/\n//;
   $cnt{_tr}=$cnt;

Quote:
};

my $_m_while = sub {
   my $cnt=0;
   my $rcdsep="\n";
   while ($dta=~m/$rcdsep/g) {$cnt++};
   $cnt{_m_while}=$cnt;

Quote:
};

my $_m_for = sub {
   my $cnt=0;
   my $rcdsep="\n";
   $cnt++ for($dta=~m/$rcdsep/g);
   $cnt{_m_for}=$cnt;

Quote:
};

my $_m_array = sub {
   my $cnt=0;
   my $rcdsep="\n";

   $cnt{_m_array}=$cnt;

Quote:
};

my $_m_while_M = sub {
   my $cnt=0;
   my $rcdsep="\n\n";
   while ($dta=~m/$rcdsep/g) {$cnt++};
   $cnt{_m_while_M}=$cnt;

Quote:
};

my $_m_for_M = sub {
   my $cnt=0;
   my $rcdsep="\n\n";
   $cnt++ for($dta=~m/$rcdsep/g);
   $cnt{_m_for_M}=$cnt;

Quote:
};

my $_m_array_M = sub {
   my $cnt=0;
   my $rcdsep="\n\n";

   $cnt{_m_array_M}=$cnt;

Quote:
};

my $_c_strncmp = sub {
   my $rcdsep="\n";
   my $cnt=c_strncmp($dta,$rcdsep);
   $cnt{_c_strncmp}=$cnt;

Quote:
};

my $_c_strncmp_M = sub {
   my $rcdsep="\n\n";
   my $cnt=c_strncmp($dta,$rcdsep);
   $cnt{_c_strncmp_M}=$cnt;

Quote:
};

my $_c_strstr = sub {
   my $rcdsep="\n";
   my $cnt=c_strstr($dta,$rcdsep);
   $cnt{_c_strstr}=$cnt;

Quote:
};

my $_c_strstr_M = sub {
   my $rcdsep="\n\n";
   my $cnt=c_strstr($dta,$rcdsep);
   $cnt{_c_strstr_M}=$cnt;

Quote:
};

$_tr->(); # seed it??

timethese(300,
               { tr =>         $_tr,
                 m_while =>    $_m_while,   m_for =>       $_m_for,   m_array =>   $_m_array,
                 m_while_M =>  $_m_while_M, m_for_M =>     $_m_for_M, m_array_M => $_m_array_M,
                 c_strncmp =>  $_c_strncmp, c_strncmp_M => $_c_strncmp_M,
                 c_strstr =>   $_c_strstr,  c_strstr_M =>  $_c_strstr_M
               }
         );

print "\n=== check counts ===\n";
printf("%18s %5d\n",$_,$cnt{$_}) for (keys %cnt);

__END__
__C__

#include <string.h>

int c_strncmp(char * buf, char * sep){
   int cnt = 0;
   int seplen = strlen(sep);
   for(; *buf ; buf++)
      if(!strncmp(buf,sep,seplen)){
         cnt++;
         buf+=seplen-1;
      }

   return cnt;

Quote:
}

int c_strstr(char * buf, char * sep){
   int cnt = 0;
   int seplen = strlen(sep);
   for(; *buf; buf+=seplen, cnt++)
      buf=strstr(buf,sep);
   return cnt;

Quote:
}

---- snip ----

thanks in advance!!

___cliff rayman___cliff_AT_rayman.com___



Fri, 22 Aug 2003 09:13:34 GMT  
 fast ways to count substrings

Quote:
>Previously, File::Tail would only handle single line entries delmited
>with a hardcoded "\n".  The program counted lines using tr/\n//. Since

[snip]

Quote:
>which can be used to count either single character or multi character
>delimiters in a string of data?? the delimiter needs to be variable at
>runtime. Actually File::Tail is object oriented, so the delimiter is
>encapsulated in the File::Tail object".

So you've just read this data in from a log? If so, how about
counting as you read? Set the input record delimiter $/ to your
varying $rcdsep value:

{
   local $/ = $rcdsep;


Quote:
}

# or, if all you want is the count (non-blank here):

{
   local $/ = $rcdsep;
   while (<LOG>) { $cnt++ unless /^$/ }

Quote:
}

NB: if you're storing a flag about the record separator within the
file itself, requiring you to read a few "normal" (\n) lines first,
you can change $/ after that for what's left in the file, no prob.

HTH

1;

--

   - Bruce

__bruce_van_allen__santa_cruz_ca__



Fri, 22 Aug 2003 11:08:19 GMT  
 fast ways to count substrings

Quote:


> >Previously, File::Tail would only handle single line entries delmited
> >with a hardcoded "\n".  The program counted lines using tr/\n//. Since

> [snip]

> >which can be used to count either single character or multi character
> >delimiters in a string of data?? the delimiter needs to be variable at
> >runtime. Actually File::Tail is object oriented, so the delimiter is
> >encapsulated in the File::Tail object".

> So you've just read this data in from a log? If so, how about
> counting as you read? Set the input record delimiter $/ to your
> varying $rcdsep value:

nope. no can do.  i wouldn't have a problem if we were slurping
up a plain old file.  check out what File::Tail does and how the developer
was careful not to use up endless clock cycles waiting for the file to be
written.  this should give u a little more background on what we are trying
to do.

--



Sat, 23 Aug 2003 14:47:18 GMT  
 
 [ 3 post ] 

 Relevant Pages 

1. ODBC and Delphi

2. The weekly FAQ item about the Pascal newsgroups' reorganization

3. counting substrings

4. counting substrings in strings

5. RFI: Counting repeated substrings within a string

6. Ways of making this subroutine faster

7. Faster yet way to count lines in a file (or at least cooler)

8. wish fast method of counting of characters

9. Gosh, s// faster than m// (was Re: Counting characters with tr/// (in perl))

10. Fastest way to count characters?

11. fastest way to count hash

12. Please advise. Fastest way to line-count files

 

 
Powered by phpBB® Forum Software