Benchmarking Regular Expressions in J3.05 
Author Message
 Benchmarking Regular Expressions in J3.05

I am trying to evaluate J's new regular expression technology for a client. My preliminary results
are not encouraging, but it is almost certainly due to the fact that I am not a good J programmer. I
would be happy for any suggestions that might improve the quality of my comparison.

Comparison of: Perl 5.000M5 (free public domain) (c) 1994 by Larry Wall
with:          J3.05a (c) 1994-1997 by Iverson Software Inc. Evaluation
               copy (I don't know if the `artificial delays' affect
               timing or not, but the Evaluation Edition seems to
               run J code somewhat faster than J3.03b, so I don't
               think they do)
Also:          GREP in `KSh for Windows-NT" (Freeware from Bell Labs)
               A GREP I implemented in 1985 before any other free ones
               were available.

Environment: Dell 200mhz 64mb Pentium Pro running NT4.0

Task:  A simple loop that reads a large text file and counts each
       line on which the RE `a.*b.*c.*d.*e.*f.*g' occurs. Data is
       from several years worth of random memos, all concatenated
       into a text data file.  

Tests: Data file 14.8mb 393,408 lines. 452 lines `match' the RE
       Perl: (a) Just pass data (no RE) (b) Pass data and count.
       J: (a) Just pass data (b) Pass and count, RE interpreted
          (c) Pass and count, RE compiled.
       Bell GREP: Pass and write match
       My GREP: Pass and write match

Results: Perl No RE: 9 sec, Perl RE: 15 sec
         J No RE: 4.44038 min, J RE Interp: 8.4575 min, J RE Comp: 6.90593 min
         Bell Labs KSh GREP 11 sec
         My ancient GREP 1.71883 min

So the comparison appears *very* unfavorable to J. If I read the results correctly, PERL spends 6
seconds evaluating REs that take J 2.5 mins (compiled) or 4 mins (interpreted). The time for the
NT/Unix based GREP is nearly as impressive as PERL's. My GREP is very slow by comparison, but still
substantially faster than J.

I received a very helpful note from Mr MacKenzie that appears to allow me to cut the `J No RE' time
from 4.4 mins down to about 3.2 mins, but that is only a small step.
A major portion of the results must surely be due to my reading the file very inefficiently as far
as J is concerned. Even given this it does seem that J takes a lot of time for RE calculations, at
least in comparison with PERL (where I would have thought my loop would mean that the RE is actually
re-interpreted with each line).

I would be happy to receive suggestions about how to make the comparison more fair. Of course, I'd
be happy to circulate the code as well...



Tue, 28 Mar 2000 03:00:00 GMT  
 Benchmarking Regular Expressions in J3.05


David Ness writes on Friday, October 10:

Quote:
> I am trying to evaluate J's new regular expression technology
> for a client. My preliminary results are not encouraging, but it is
> almost certainly due to the fact that I am not a good J programmer. I
> would be happy for any suggestions that might improve the quality of
> my comparison.

> Comparison of: Perl 5.000M5 (free public domain) (c) 1994 by Larry Wall
> with:          J3.05a (c) 1994-1997 by Iverson Software Inc. Evaluation
>                copy (I don't know if the `artificial delays' affect
>                timing or not, but the Evaluation Edition seems to
>                run J code somewhat faster than J3.03b, so I don't
>                think they do)
> Also:          GREP in `KSh for Windows-NT" (Freeware from Bell Labs)
>                A GREP I implemented in 1985 before any other free ones
>                were available.

> Environment: Dell 200mhz 64mb Pentium Pro running NT4.0

> Task:  A simple loop that reads a large text file and counts each
>        line on which the RE `a.*b.*c.*d.*e.*f.*g' occurs. Data is
>        from several years worth of random memos, all concatenated
>        into a text data file.

> Tests: Data file 14.8mb 393,408 lines. 452 lines `match' the RE
>        Perl: (a) Just pass data (no RE) (b) Pass data and count.
>        J: (a) Just pass data (b) Pass and count, RE interpreted
>           (c) Pass and count, RE compiled.
>        Bell GREP: Pass and write match
>        My GREP: Pass and write match

> Results: Perl No RE: 9 sec, Perl RE: 15 sec
>          J No RE: 4.44038 min, J RE Interp: 8.4575 min, J RE Comp: 6.90593 min
>          Bell Labs KSh GREP 11 sec
>          My ancient GREP 1.71883 min
> ...

All timings below obtained on Pentium 166 mhz, 64 mb,
running NT 4.

I will address mainly the question of file reads.
First, I created a random file with the roughly the
specified characteristics:

   Lf=: 10{a.
   t=: 14.8e6$(32+i.95){a.
   t=: Lf (+/\ ?. 4e5$74)}t
   t 1!:2 <'temp\foo.x'

Reading this file in a subsequent session,

   timer 't=.1!:1 <''temp\foo.x'''
3.935

This sets a lower bound on the time required, with more
time required if space limitations forces reading the
file a block at a time.

The following code reads the file about fBlock
bytes at a time, returning a string terminated by Lf.
Any slop-over (fSlop) is tacked onto the front of the
next read, and on the last read there would be no
slop-over.  Three functions are provided:

FInit takes a file name argument, and initializes
various global variables.

FRead reads the next block of data, and returns an
Lf-terminated string.

Process is an example of the "No RE" case:  just reading
the file a block at a time, with a trivial amount of
processing.

FInit=: 3 : 0
 fName =: y.
 fIndex=: 0
 fSize =: 1!:4 y.
 fSlop =: ''
 i.0 0
)

FRead=: 3 : 0
 n=. fBlock <. fSize-fIndex
 i=. fIndex
 fIndex=: n+fIndex
 t=. fSlop,1!:11 fName,<i,n
 i=. (n=fBlock) * - (|.t) i. Lf  NB. index of last Lf
 fSlop=: i{.t
 i}.t
)

Process=: 3 : 0
 k=. 0
 FInit y.
 for. i. 0 ,~ >.fSize % fBlock do.
  k=. k+#FRead ''
 end.
)

Using a block size of 1 mb:

   fBlock=: 1e6
   timer 'k=: Process <''temp\foo.x'''
7.341

   k
14800000
   $fSlop
0

And a block size of 100 kb, in a new session:

   fBlock=: 1e5
   timer 'k=. Process <''temp\foo.x'''
7.591

That is, reading the file a block at a time, with
trivial processing, takes about 7.5 seconds on a P166,
compared to the time you obtained of 4.44 minutes
on a Pentium Pro 200 mhz.

Regarding the regular expression evaluation:  I suspect
that the computation would be much faster, if you give it
a block at a time, Lf embedded and all.  (You would have
to modify the RE to take the Lf into account.)  It is
generally true that giving the interpreter big chunks
of data at a time leads to faster computation than
giving it a small chunk at a time.



Wed, 29 Mar 2000 03:00:00 GMT  
 Benchmarking Regular Expressions in J3.05

Quote:
> > Task:  A simple loop that reads a large text file and counts each
> >        line on which the RE `a.*b.*c.*d.*e.*f.*g' occurs. Data is
> >        from several years worth of random memos, all concatenated
> >        into a text data file.
 ...
> Regarding the regular expression evaluation:  I suspect
> that the computation would be much faster, if you give it
> a block at a time, Lf embedded and all.  (You would have
> to modify the RE to take the Lf into account.)  It is
> generally true that giving the interpreter big chunks
> of data at a time leads to faster computation than
> giving it a small chunk at a time.

The problem with this approach is that the logic of the program gets
correspondingly more complex and counter-intuitive leading to increased
development time and higher maintenance costs.

It's interesting to see that the "natural" solution to the problem in
Perl is reasonably efficient.  Out of curiosity, I tried several variations
in Perl for comparison.  My platform is a low end SPARC (SPARC 4, 110MHz
with 64M running Solaris 2.5 and Perl 5.004) which cost the University about
$4.6K (Cdn) to put on my desk.

The most convenient file I could find was a mail folder of Sun announcements,
about half the size in the original message (7,211,522 bytes with 201,446 lines
and 78 matches.)  The file in my case was on an NFS mount point as are the
Perl libraries.  The resolution of the CPU time is .01sec while the elapsed
time computation is truncated to a second.

Using what I consider the most "natural" Perl constructs:

     while(<IF>) {}

        consumed typically 3.31 seconds CPU in 3 seconds of elapsed time.

     while(<IF>) { /a.*b.*c.*d.*e.*f.*g/ && ++$ct }

        consumed between 10.4 and 13.5 seconds CPU over about 24 sec elapsed

Perl's driving philosophy is TIMTOWTDI (There is more than one way to do it.)
so I tried some other Perl constructs:

     $ct= grep(/a.*b.*c.*d.*e.*f.*g/,<IF>);

        made little difference in CPU but the elapsed was typically double

Perl also allows you to read blocks of data, so I tried the following logic
to read and parse the file:

         my $buf= '';
         for($r=1;;++$r) {
             my $rl= sysread(IF,$buf,$chunk_size,length $buf);
             last unless $rl;
             $buf=~ s{a.*b.*c.*d.*e.*f.*g.*$}{++$ct;''}megx;
             $buf=($buf=~/\n([^\n]+)$/s) ? $1 : '';
         }

Omitting the two statements to parse so I'm just reading the file reduced the
CPU time by an order of magnitude.  Typically .3 sec CPU in an elapsed time
of a second or less.  Note that the expression now to pick the pattern out
of a bunch of lines is more complex and I need a second expression to pick
up the possible fragment of a line in the buffer before looping.  The CPU
time increased to about 17 seconds with an elapsed time of 28.  Making
the $chunk_size 4K and 1M in two separate runs only dropped the CPU time
by about a second (reading 1762 chunks vs 8 chunks.)

An expert in Perl could probably optimize this somewhat.
--



Wed, 29 Mar 2000 03:00:00 GMT  
 Benchmarking Regular Expressions in J3.05

Thanks for a very prompt, interesting and illuminating reply. I'll
rework my code along the lines you suggest...

Quote:


> David Ness writes on Friday, October 10:

> > I am trying to evaluate J's new regular expression technology
> > for a client. My preliminary results are not encouraging, but it is
> > almost certainly due to the fact that I am not a good J programmer. I
> > would be happy for any suggestions that might improve the quality of
> > my comparison.

> > Comparison of: Perl 5.000M5 (free public domain) (c) 1994 by Larry Wall
> > with:          J3.05a (c) 1994-1997 by Iverson Software Inc. Evaluation
> >                copy (I don't know if the `artificial delays' affect
> >                timing or not, but the Evaluation Edition seems to
> >                run J code somewhat faster than J3.03b, so I don't
> >                think they do)
> > Also:          GREP in `KSh for Windows-NT" (Freeware from Bell Labs)
> >                A GREP I implemented in 1985 before any other free ones
> >                were available.

> > Environment: Dell 200mhz 64mb Pentium Pro running NT4.0

> > Task:  A simple loop that reads a large text file and counts each
> >        line on which the RE `a.*b.*c.*d.*e.*f.*g' occurs. Data is
> >        from several years worth of random memos, all concatenated
> >        into a text data file.

> > Tests: Data file 14.8mb 393,408 lines. 452 lines `match' the RE
> >        Perl: (a) Just pass data (no RE) (b) Pass data and count.
> >        J: (a) Just pass data (b) Pass and count, RE interpreted
> >           (c) Pass and count, RE compiled.
> >        Bell GREP: Pass and write match
> >        My GREP: Pass and write match

> > Results: Perl No RE: 9 sec, Perl RE: 15 sec
> >          J No RE: 4.44038 min, J RE Interp: 8.4575 min, J RE Comp: 6.90593 min
> >          Bell Labs KSh GREP 11 sec
> >          My ancient GREP 1.71883 min
> > ...

> All timings below obtained on Pentium 166 mhz, 64 mb,
> running NT 4.

> I will address mainly the question of file reads.
> First, I created a random file with the roughly the
> specified characteristics:

>    Lf=: 10{a.
>    t=: 14.8e6$(32+i.95){a.
>    t=: Lf (+/\ ?. 4e5$74)}t
>    t 1!:2 <'temp\foo.x'

> Reading this file in a subsequent session,

>    timer 't=.1!:1 <''temp\foo.x'''
> 3.935

> This sets a lower bound on the time required, with more
> time required if space limitations forces reading the
> file a block at a time.

> The following code reads the file about fBlock
> bytes at a time, returning a string terminated by Lf.
> Any slop-over (fSlop) is tacked onto the front of the
> next read, and on the last read there would be no
> slop-over.  Three functions are provided:

> FInit takes a file name argument, and initializes
> various global variables.

> FRead reads the next block of data, and returns an
> Lf-terminated string.

> Process is an example of the "No RE" case:  just reading
> the file a block at a time, with a trivial amount of
> processing.

> FInit=: 3 : 0
>  fName =: y.
>  fIndex=: 0
>  fSize =: 1!:4 y.
>  fSlop =: ''
>  i.0 0
> )

> FRead=: 3 : 0
>  n=. fBlock <. fSize-fIndex
>  i=. fIndex
>  fIndex=: n+fIndex
>  t=. fSlop,1!:11 fName,<i,n
>  i=. (n=fBlock) * - (|.t) i. Lf  NB. index of last Lf
>  fSlop=: i{.t
>  i}.t
> )

> Process=: 3 : 0
>  k=. 0
>  FInit y.
>  for. i. 0 ,~ >.fSize % fBlock do.
>   k=. k+#FRead ''
>  end.
> )

> Using a block size of 1 mb:

>    fBlock=: 1e6
>    timer 'k=: Process <''temp\foo.x'''
> 7.341

>    k
> 14800000
>    $fSlop
> 0

> And a block size of 100 kb, in a new session:

>    fBlock=: 1e5
>    timer 'k=. Process <''temp\foo.x'''
> 7.591

> That is, reading the file a block at a time, with
> trivial processing, takes about 7.5 seconds on a P166,
> compared to the time you obtained of 4.44 minutes
> on a Pentium Pro 200 mhz.

> Regarding the regular expression evaluation:  I suspect
> that the computation would be much faster, if you give it
> a block at a time, Lf embedded and all.  (You would have
> to modify the RE to take the Lf into account.)  It is
> generally true that giving the interpreter big chunks
> of data at a time leads to faster computation than
> giving it a small chunk at a time.



Wed, 29 Mar 2000 03:00:00 GMT  
 Benchmarking Regular Expressions in J3.05

Quote:


> David Ness writes on Friday, October 10:

><snip of my earlier>

> All timings below obtained on Pentium 166 mhz, 64 mb,
> running NT 4.

> I will address mainly the question of file reads.
> First, I created a random file with the roughly the
> specified characteristics:

>    Lf=: 10{a.
>    t=: 14.8e6$(32+i.95){a.
>    t=: Lf (+/\ ?. 4e5$74)}t
>    t 1!:2 <'temp\foo.x'

> Reading this file in a subsequent session,

>    timer 't=.1!:1 <''temp\foo.x'''
> 3.935

> This sets a lower bound on the time required, with more
> time required if space limitations forces reading the
> file a block at a time.

> The following code reads the file about fBlock
> bytes at a time, returning a string terminated by Lf.
> Any slop-over (fSlop) is tacked onto the front of the
> next read, and on the last read there would be no
> slop-over.  Three functions are provided:

> FInit takes a file name argument, and initializes
> various global variables.

> FRead reads the next block of data, and returns an
> Lf-terminated string.

> Process is an example of the "No RE" case:  just reading
> the file a block at a time, with a trivial amount of
> processing.

> FInit=: 3 : 0
>  fName =: y.
>  fIndex=: 0
>  fSize =: 1!:4 y.
>  fSlop =: ''
>  i.0 0
> )

> FRead=: 3 : 0
>  n=. fBlock <. fSize-fIndex
>  i=. fIndex
>  fIndex=: n+fIndex
>  t=. fSlop,1!:11 fName,<i,n
>  i=. (n=fBlock) * - (|.t) i. Lf  NB. index of last Lf
>  fSlop=: i{.t
>  i}.t
> )

> Process=: 3 : 0
>  k=. 0
>  FInit y.
>  for. i. 0 ,~ >.fSize % fBlock do.
>   k=. k+#FRead ''
>  end.
> )

> Using a block size of 1 mb:

>    fBlock=: 1e6
>    timer 'k=: Process <''temp\foo.x'''
> 7.341

>    k
> 14800000
>    $fSlop
> 0

> And a block size of 100 kb, in a new session:

>    fBlock=: 1e5
>    timer 'k=. Process <''temp\foo.x'''
> 7.591

> That is, reading the file a block at a time, with
> trivial processing, takes about 7.5 seconds on a P166,
> compared to the time you obtained of 4.44 minutes
> on a Pentium Pro 200 mhz.

> Regarding the regular expression evaluation:  I suspect
> that the computation would be much faster, if you give it
> a block at a time, Lf embedded and all.  (You would have
> to modify the RE to take the Lf into account.)  It is
> generally true that giving the interpreter big chunks
> of data at a time leads to faster computation than
> giving it a small chunk at a time.

I think I have (successfully) used your code to access my file in blocks. As your timing figures
suggest, I am able to acquire the blocks quite quickly, but the process still falls apart when I try
to use the individual lines.

I have implemented FInit and FRead pretty much as you suggest. They seem to work well, and if I just
FRead the blocks of my data file, I do get them in a few seconds. However, what I need, before even
getting to the Regular Expression problem, is to access the lines individually. So I have tried to
create `Line' that will let me get to the data one line at  a time as I do in PERL.

I have tried one approach were I looked for all of the LFs in the block that was loaded, and then
tried to pick off the lines one by one. This gave me:

 LFIdx =: ''
 Line =: 3 : 0
   if. 3 > #LFIdx do.
 NB.   LFIdx is EOL+1 = BOL positions, 1st is already read or dummy

       Sel =: }. {.&a
       if. 0 = #a do. i. 0 0 return. end.
       end.
   Sel/2{.LFIdx =: }.LFIdx
 )

This seemed to take forever, so I quit it.

Then I tried a different approach where I would use `Cut' to chop the block into lines...

 Lines =: ''
 Line =: 3 : 0
   if. 0 = #Lines do.
      Lines =: < ;. _2 Fread ''
      if 0 =: #Lines do. i. 0 0 return. end.
      end.
 t =. {. Lines
 Lines =: }. Lines
 >t
 )

Although the cut itself seemed blazingly fast, this ended up taking forever too. I quit it, as well.

I'd be happy to try other approaches. However, I obviously am having considerable trouble developing
a sense how much time it takes to perform functions. So far, accessing the individual lines seems to
cause my processes to take (many) minutes, and I need an approach that will get the time within
shooting range of the 9 secs. it takes PERL to access the file line by line, or 15 secs. it takes to
handle the whole problem, RE calculations included. So I'm still far far from where I'd like to be.



Thu, 30 Mar 2000 03:00:00 GMT  
 Benchmarking Regular Expressions in J3.05

Quote:

> Then I tried a different approach where I would use `Cut' to chop the block into lines...

>  Lines =: ''
>  Line =: 3 : 0
>    if. 0 = #Lines do.
>       Lines =: < ;. _2 Fread ''
>       if 0 =: #Lines do. i. 0 0 return. end.
>       end.
>  t =. {. Lines
>  Lines =: }. Lines
>  >t
>  )

> Although the cut itself seemed blazingly fast, this ended up taking forever too. I quit it, as well.

Languages have their own idioms.  For j perhaps you want to try  replacing  <
with whatever you want done on each line.
eythan


Fri, 31 Mar 2000 03:00:00 GMT  
 Benchmarking Regular Expressions in J3.05

Quote:

>Then I tried a different approach where I would use `Cut' to chop the block into lines...
> Lines =: ''
> Line =: 3 : 0
>   if. 0 = #Lines do.
>      Lines =: < ;. _2 Fread ''
>      if 0 =: #Lines do. i. 0 0 return. end.
>      end.
> t =. {. Lines
> Lines =: }. Lines
> >t
> )
>Although the cut itself seemed blazingly fast, this ended up taking forever too. I quit it, as well.

The problem is

 Lines =: }. Lines

When you behead a big array, you get an almost-as-big array, and J
copies the whole new array.  So, this automatically makes the function
O(*: n)

The natural way to do this sort of thing in J is to let the function
return an ARRAY of lines, not just one line at a time; then
you can do

  function L:0

on the resulting array.  If the file is too big to fit, make sure that
keep processing big subsets at all times.

Henry Rich



Sat, 01 Apr 2000 03:00:00 GMT  
 Benchmarking Regular Expressions in J3.05

Quote:

> How do you compute the sum of a vector in PERL?


--
Raul



Mon, 03 Apr 2000 03:00:00 GMT  
 Benchmarking Regular Expressions in J3.05

Quote:

>    + /     sum
>    >./     max

>    + /\    sum prefix (partial sums)
>    >./\    max prefix (partial maxima)

>    + /\.   sum suffix
>    >./\.   max suffix

> How do you effect the above three families of
> computations in PERL or the Perl Data Language?

It's true that these families of computation are not provided natively
in perl.  However, it's not unreasonable to define them:

######################################################################
# provide verbs and adverbs needed for families

sub add {(shift)+shift}




######################################################################
# define families

$sum= \&{insert(\&add)};
$max= \&{insert(\&big)};

$partial_sums= \&{prefix($sum)};
$partial_maxima= \&{prefix($max)};

$trailing_sums= \&{suffix($sum)};
$trailing_maxima= \&{suffix($max)};

######################################################################
# example use



p;


p;


######################################################################
# sum: 21
# max: 6
#
# sum prefix: 1 3 6 10 15 21
# max prefix: 1 2 3 4 5 6
#
# sum suffix: 21 20 18 15 11 6
# max suffix: 6 6 6 6 6 6

--
Raul



Mon, 03 Apr 2000 03:00:00 GMT  
 Benchmarking Regular Expressions in J3.05

Raul Miller writes on Wednesday, October 15:

Quote:

> >    + /     sum
> >    >./     max

> >    + /\    sum prefix (partial sums)
> >    >./\    max prefix (partial maxima)

> >    + /\.   sum suffix
> >    >./\.   max suffix

> > How do you effect the above three families of
> > computations in PERL or the Perl Data Language?

> It's true that these families of computation are not provided natively
> in perl.  However, it's not unreasonable to define them:

> ######################################################################
> # provide verbs and adverbs needed for families

> sub add {(shift)+shift}




> ######################################################################
> # define families

> $sum= \&{insert(\&add)};
> $max= \&{insert(\&big)};

> $partial_sums= \&{prefix($sum)};
> $partial_maxima= \&{prefix($max)};

> $trailing_sums= \&{suffix($sum)};
> $trailing_maxima= \&{suffix($max)};

This sub-thread started with the assertion that working on large pieces
of data, the problem is that "the logic of the program gets correspondingly
more complex and counter-intuitive leading to increased development time and
higher maintenance costs".  I believe the above examples refute that assertion.


Tue, 04 Apr 2000 03:00:00 GMT  
 
 [ 10 post ] 

 Relevant Pages 

1. comp.lang.modula2: Answers to Common Questions - v1.4 93.05.05

2. J3 Fortran Standards Committee Approves Floating-Point Functions in Initialization Expressions

3. regular expression matching in J ? (or APL)

4. Tgen, linear algebra, and regular expression package available

5. Regular Expressions

6. Regular Expression Matcher v1.1

7. apl and regular expressions

8. php like regular expressions in apl?

9. Support for regular expressions in APL?

10. regular expression discussion

11. Regular Expressions in J

12. Regular Expression to match HTML elements

 

 
Powered by phpBB® Forum Software