
Efficiency of some regexp patterns
Quote:
>I don't know how class negation and the "?" modififier work behind the
>scenes. Is it quicker to match against this kind of pattern
> <foo[^>]+>
>or against this version of almost the same thing
> <foo.+?>
I have problems answering this question. Even by looking at it
carefully and reading the Camel Book page 63, it's hard to decide.
There's a comparison made for every characters until it matches a > in
both patterns, but the difference between minimal and greedy matching
makes it tough to decide. Will the engine try many matches in the
string with the first pattern? That would make the difference and I
would tend to say that the second pattern is faster, at least for
longer strings with many tags. Any Perl expert here to give us an
insight about greedy matching?
Quote:
>Why negation of classes in Unicode is supposed to be slow in general?
>Isn't [^c] tested somehow as $current_char ne "c"? Why does a broader
>character set matter?
[^c] is the original character set minus "c". If our original set is
Unicode, our new set is big: the cardinality of the Unicode set - 1.
wrong here, but I think that the test will look more like:
if ($current_char eq $char) { ... }
Quote:
}
Is the pattern matching algorithm optimized to look at the new set and
decide if it would be faster to use eq with the new set or ne with the
negation of the new set? I don't know. But I'm sure that if your
character set is composed of half the characters of the Unicode set,
then it's going to be slower.
Sbastien Nadeau
Technicien en informatique
Bibliothque de l'Universit Laval