On Sun, 17 Oct 1999 12:26:49 -0400,
Quote:
> I am looking for a high performance way to substitute many
> strings in one pass over the input. Suppose I want to encode
What if it is more efficient to do it in multiple passes? Would you
believe that it's faster to remove trailing _and_ leading spaces from
a string in two passes than in one pass? At least if you do it with a
regexp.
Just to make you aware of the fact that one pass is not necessarily
faster than two or more passes. If your single pass involves something
that's of a higher order than O(N), but multiple passes would all be
of O(N), then you lose out when you use the single pass approach for
large N.
Quote:
> a string for use as a value in HTML or XML, e.g.
> s/\</\</g;
> s/\>/\>/g; // useful even when not strictly necessary
> s/\"/"/g;
> Currently this requires three passes over the data.
> If tr/// allowed strings on the right hand side
> (by somehow extending its notation) this could be done in one pass,
> which would be about 3 time faster for long input strings.
Those things you're replacing with are probably not what you want.
(See below) Apart from their syntax being wrong, you might consider
using the numeric character references. They're slightly more
portable. (www.w3.org)
use strict;
my %repl = (
'<' => '<',
'>' => '>',
'"' => '"',
);
my $repl_keys = join "", keys %repl;
my $str = 'Just a string, <with> some "extra" stuff in it';
$str =~ s/([$repl_keys])/$repl{$1}/g;
Be careful when the keys contain one of the characters that becomes
special in a character class.
Quote:
> Similarly by allowing strings on the left hand side of tr/// the
> inverse translation would be faster.
> We must be careful when two string overlaps, e.g $lt vs. $lte.
You mean < versus <e;? You keep forgetting that these things are
supposed to be terminated. And then of course we have re boundary
things like \b.
Quote:
> The usual solution is the "maximal munch" rule --longest match wins.
Not necessarily. The normal way is to bind it to some common
terminating or delimiting character, like the mandatory ';', or a \b,
or a \s or something. In _most_ cases there will be something like
that, because otherwise parsers would have a damn hard way
distinguishing between various possibilities.
Another way that's often used is to 'or' all the possibilities
together.
# perldoc perlfaq6
How do I efficiently match many regular expressions at once?
You could revert the hash of the above example to get to all of the
possible matches.
Quote:
> But now comes the hard part -- we also needed
> s/\&/&/g;
There is a falw^Wflaw in your logic. The stuff that goes in is
supposed to be plain text, right? The stuff that comes out should be
the XML or HTML representation of that text, right? That means that if
there is _any_ literal '&' in your input, it needs to be translated
into &, even if it is part of something like &. Multiple
levels of interpretation require multiple levels of escapement.
How else are you going to be able to translate something in one
direction four times, and then get back to the original?
Quote:
> Q1. Has there been any thought to extending tr/// to allow strings on
> either or both sides, rather than just simple characters?
Nope. That's not what tr// is for. I'm sure it comes up now and
again, but I'm also sure that people are quite unwilling to change it.
Regular expressions can do what you want them to do. You just don't
seem to know enough about them yet.
Quote:
> Q2. Is there a good way to accomplish multiple substitutions in one
> pass in native Perl?
Err.. native Perl? regular Expressions do the job..
see above, see perlre, see perlfaq6
Quote:
> Q3. Is there any XS implementations that do this in one pass? I could
> live with the hard coded constants, known for HTML, XML, and XHTML.
> But I would like general support for this as the need occurs in many
> other contexts.
Why do you think XS would be faster than the builtin regular
expressions?
Most of this is a bit of a moot point anyway. As long as you're only
replacing characters with escaped versions, and back, this may work
(that is, if you do it _correctly_), but any other way of working with
HTML, XML or SGML cannot be done with regular expressions. You will
have to use a parser. HTML::Parser is a possible reasonable go at it,
although Abigail will tell you that that doesn't parse HTML. There are
many other HTML::* modules on CPAN. I am pretty sure that there is one
that does what you need it to do.
Martien
--
Martien Verbruggen |
Interactive Media Division | Unix is user friendly. It's just selective
Commercial Dynamics Pty. Ltd. | about its friends.
NSW, Australia |