FAQ: How do I efficiently match many regular expressions at once? 
Author Message
 FAQ: How do I efficiently match many regular expressions at once?

This message is one of several periodic postings to comp.lang.perl.misc
intended to make it easier for perl programmers to find answers to
common questions. The core of this message represents an excerpt
from the documentation provided with every Standard Distribution of
Perl.

+
  How do I efficiently match many regular expressions at once?

    The following is extremely inefficient:

        # slow but obvious way

        while (defined($line = <>)) {

                if ($line =~ /\b$state\b/i) {  
                    print $line;
                    last;
                }
            }
        }                                        

    That's because Perl has to recompile all those patterns for each of the
    lines of the file. As of the 5.005 release, there's a much better
    approach, one which makes use of the new "qr//" operator:

        # use spiffy new qr// operator, with /i flag even
        use 5.005;


        while (defined($line = <>)) {

                print $line if $line =~ /$patobj/;
            }
        }

-

Documents such as this have been called "Answers to Frequently
Asked Questions" or FAQ for short.  They represent an important
part of the Usenet tradition.  They serve to reduce the volume of
redundant traffic on a news group by providing quality answers to
questions that keep coming up.

If you are some how irritated by seeing these postings you are free
to ignore them or add the sender to your killfile.  If you find
errors or other problems with these postings please send corrections
or comments to the posting email address or to the maintainers as
directed in the perlfaq manual page.

Answers to questions about LOTS of stuff, mostly not related to
Perl, can be found by pointing your news client to


or to the many thousands of other useful Usenet news groups.

Note that the FAQ text posted by this server may have been modified
from that distributed in the stable Perl release.  It may have been
edited to reflect the additions, changes and corrections provided
by respondents, reviewers, and critics to previous postings of
these FAQ. Complete text of these FAQ are available on request.

The perlfaq manual page contains the following copyright notice.

  AUTHOR AND COPYRIGHT

    Copyright (c) 1997-1999 Tom Christiansen and Nathan
    Torkington.  All rights reserved.

This posting is provided in the hope that it will be useful but
does not represent a commitment or contract of any kind on the part
of the contributers, authors or their agents.

                                                           06.16
--
    This space intentionally left blank



Wed, 17 Nov 2004 19:17:03 GMT  
 FAQ: How do I efficiently match many regular expressions at once?


:    The following is extremely inefficient:

:    That's because Perl has to recompile all those patterns for each of the
:    lines of the file.

:As of the 5.005 release, there's a much better
:    approach, one which makes use of the new "qr//" operator:

Could someone compare these two approaches with a third?



        while (defined($line = <>)) {
           print $line if $line =~ /$poppat/o;
        }

The above would not work if popstates could change, but this
hybrid variation should work for that:



        $poppatqr = qr/$poppat/;
        while (defined($line = <>)) {
           print $line if $line =~ $poppatqr;
        }

If perl re used DFA's, then in theory |'ing a series together would
be at least as efficient as testing each individually (and would be
more efficient where there were common prefixes). I've read, though,
that perl's re matcher is significantly different internally: could |'ing
a series end up worse than testing each in turn? I assume for this
purpose that the re's do not contain {}'s that could modify the
meaning of the other re's.



Fri, 19 Nov 2004 21:53:09 GMT  
 FAQ: How do I efficiently match many regular expressions at once?

Quote:

> If perl re used DFA's, then in theory |'ing a series together would
> be at least as efficient as testing each individually (and would be
> more efficient where there were common prefixes). I've read, though,
> that perl's re matcher is significantly different internally: could |'ing
> a series end up worse than testing each in turn? I assume for this
> purpose that the re's do not contain {}'s that could modify the
> meaning of the other re's.

It would depend on the relative sizes of the patterns and targets.  If the
targets were significantly longer than the sum of the lengths of the
patterns, than the |'d method would surely win.  If the targets were of
similar size to the patterns, then it would be a tossup, but the |'ing
solution might still win because the rescanning would be happening inside
the regex engine rather than in perl's dispatch loop.  If the targets were
on average shorter than the patterns, then the testing in turn would
probably be faster, although in that (rather pathological) case it would
probably be worth it to test the target's length and skip the matching
altogether if it were shorter than the shortest pattern.


Sat, 20 Nov 2004 10:16:45 GMT  
 FAQ: How do I efficiently match many regular expressions at once?

:> If perl re used DFA's, then in theory |'ing a series together would
:> be at least as efficient as testing each individually (and would be
:> more efficient where there were common prefixes). I've read, though,
:> that perl's re matcher is significantly different internally: could |'ing
:> a series end up worse than testing each in turn?

:It would depend on the relative sizes of the patterns and targets.  If the
:targets were significantly longer than the sum of the lengths of the
:patterns, than the |'d method would surely win.  If the targets were of
:similar size to the patterns, then it would be a tossup, but the |'ing
:solution might still win because the rescanning would be happening inside
:the regex engine rather than in perl's dispatch loop.  If the targets were
:on average shorter than the patterns, then the testing in turn would
:probably be faster, although in that (rather pathological) case it would
:probably be worth it to test the target's length and skip the matching
:altogether if it were shorter than the shortest pattern.

Thanks for replying.

Unfortunately I haven't been able to figure yet how the relative
length of targets and patterns would have that effect.

I don't immediately see why, for example, /a{9}/ would be
more efficient than /aaaaaaaaa/ in matching a string of a's --
which is an effect that would be implied by discussions of
the relative sizes of the patterns the the targets. There are
short patterns that are much more expensive to match than
some substantially longer ones.



Sun, 21 Nov 2004 03:26:20 GMT  
 
 [ 4 post ] 

 Relevant Pages 

1. FAQ 6.17 How do I efficiently match many regular expressions at once?

2. FAQ 6.17 How do I efficiently match many regular expressions at once?

3. FAQ 6.17 How do I efficiently match many regular expressions at once?

4. FAQ 6.17 How do I efficiently match many regular expressions at once?

5. FAQ 6.17 How do I efficiently match many regular expressions at once?

6. FAQ 6.17 How do I efficiently match many regular expressions at once?

7. FAQ 6.17 How do I efficiently match many regular expressions at once?

8. FAQ: How do I efficiently match many regular expressions at once?

9. FAQ: How do I efficiently match many regular expressions at once?

10. FAQ: How do I efficiently match many regular expressions at once?

11. FAQ: How do I efficiently match many regular expressions at once?

12. FAQ 6.16: How do I efficiently match many regular expressions at once?

 

 
Powered by phpBB® Forum Software