Substituting many strings in one pass over the input 
Author Message
 Substituting many strings in one pass over the input

Quote:

> I am looking for a high performance way to substitute many
> strings in one pass over the input.  Suppose I want to encode
> a string for use as a value in HTML or XML, e.g.
> s/\</\&lt/g;
> s/\>/\&gt/g;  // useful even when not strictly necessary
> s/\"/&quot/g;

Too many backwhacks! But surely you want something other than
HTML::Entities, which you can get from CPAN?

    http://www.*-*-*.com/

But the more general case could be done something like this.
(HTML::Entities does something like this, too.)

    s/([<>"&])/&$table{$1}/g;

Quote:
> Q1.  Has there been any thought to extending tr/// to allow strings on
> either or both sides, rather than just simple characters?

I don't think this is something which should be an extension to tr///. But
maybe I just don't understand what you want.

Cheers!

--
Tom Phoenix       Perl Training and Hacking       Esperanto
Randal Schwartz Case:     http://www.*-*-*.com/



Thu, 04 Apr 2002 03:00:00 GMT  
 Substituting many strings in one pass over the input
Mailed & posted.


Quote:

>s/\</\&lt/g;
>s/\>/\&gt/g;  // useful even when not strictly necessary
>s/\"/&quot/g;
>Currently this requires three passes over the data.
[snip]
>Q2.  Is there a good way to accomplish multiple substitutions in one
>pass in native Perl?

In one script I use the following:

BEGIN {
    %subst = ('<' => '&lt;', '>' => '&gt;', '&' => '&amp;',
              "\x0a" => '',  "\x0c" => '',  "\x0d" => '');

Quote:
}

# Put in some entities and get rid of CRs and FFs.

s:([\x0a\x0c\x0d<>&]):$subst{$1}:g;

Of course, this assumes there aren't any entities already in the text.
If there are, you'll have to do some lookahead stuff.



Thu, 04 Apr 2002 03:00:00 GMT  
 Substituting many strings in one pass over the input
On Sun, 17 Oct 1999 12:26:49 -0400,

Quote:
> I am looking for a high performance way to substitute many
> strings in one pass over the input.  Suppose I want to encode

What if it is more efficient to do it in multiple passes? Would you
believe that it's faster to remove trailing _and_ leading spaces from
a string in two passes than in one pass? At least if you do it with a
regexp.

Just to make you aware of the fact that one pass is not necessarily
faster than two or more passes. If your single pass involves something
that's of a higher order than O(N), but multiple passes would all be
of O(N), then you lose out when you use the single pass approach for
large N.

Quote:
> a string for use as a value in HTML or XML, e.g.
> s/\</\&lt/g;
> s/\>/\&gt/g;  // useful even when not strictly necessary
> s/\"/&quot/g;
> Currently this requires three passes over the data.
> If tr/// allowed strings on the right hand side
> (by somehow extending its notation) this could be done in one pass,
> which would be about 3 time faster for long input strings.

Those things you're replacing with are probably not what you want.
(See below) Apart from their syntax being wrong, you might consider
using the numeric character references. They're slightly more
portable. (www.w3.org)

use strict;

my %repl = (
        '<'          => '&lt;',
        '>'          => '&gt;',
        '"'                => '&quot;',
);
my $repl_keys = join "", keys %repl;

my $str = 'Just a string, <with> some "extra" stuff in it';

$str =~ s/([$repl_keys])/$repl{$1}/g;

Be careful when the keys contain one of the characters that becomes
special in a character class.

Quote:
> Similarly by allowing strings on the left hand side of tr/// the
> inverse translation would be faster.  
> We must be careful when two string overlaps, e.g $lt vs. $lte.  

You mean &lt; versus &lte;? You keep forgetting that these things are
supposed to be terminated. And then of course we have re boundary
things like \b.

Quote:
> The usual solution is the "maximal munch" rule --longest match wins.

Not necessarily. The normal way is to bind it to some common
terminating or delimiting character, like the mandatory ';', or a \b,
or a \s or something. In _most_ cases there will be something like
that, because otherwise parsers would have a damn hard way
distinguishing between various possibilities.

Another way that's often used is to 'or' all the possibilities
together.

# perldoc perlfaq6
     How do I efficiently match many regular expressions at once?

You could revert the hash of the above example to get to all of the
possible matches.

Quote:
> But now comes the hard part -- we also needed
> s/\&/&amp/g;

There is a falw^Wflaw in your logic. The stuff that goes in is
supposed to be plain text, right? The stuff that comes out should be
the XML or HTML representation of that text, right? That means that if
there is _any_ literal '&' in your input, it needs to be translated
into &amp;, even if it is part of something like &amp;. Multiple
levels of interpretation require multiple levels of escapement.

How else are you going to be able to translate something in one
direction four times, and then get back to the original?

Quote:
> Q1.  Has there been any thought to extending tr/// to allow strings on
> either or both sides, rather than just simple characters?

Nope. That's not what tr// is for. I'm sure it comes up now and
again, but I'm also sure that people are quite unwilling to change it.
Regular expressions can do what you want them to do. You just don't
seem to know enough about them yet.

Quote:
> Q2.  Is there a good way to accomplish multiple substitutions in one
> pass in native Perl?

Err.. native Perl? regular Expressions do the job..

see above, see perlre, see perlfaq6

Quote:
> Q3.  Is there any XS implementations that do this in one pass?  I could
> live with the hard coded constants, known for HTML, XML, and XHTML.  
> But I would like general support for this as the need occurs in many
> other contexts.

Why do you think XS would be faster than the builtin regular
expressions?

Most of this is a bit of a moot point anyway. As long as you're only
replacing characters with escaped versions, and back, this may work
(that is, if you do it _correctly_), but any other way of working with
HTML, XML or SGML cannot be done with regular expressions. You will
have to use a parser. HTML::Parser is a possible reasonable go at it,
although Abigail will tell you that that doesn't parse HTML. There are
many other HTML::* modules on CPAN. I am pretty sure that there is one
that does what you need it to do.

Martien
--
Martien Verbruggen              |
Interactive Media Division      | Unix is user friendly. It's just selective
Commercial Dynamics Pty. Ltd.   | about its friends.
NSW, Australia                  |



Thu, 04 Apr 2002 03:00:00 GMT  
 Substituting many strings in one pass over the input
: I am looking for a high performance way to substitute many
: strings in one pass over the input.  

   Use a hash to define the replacement pairs, and build
   a regex of the hash's keys.

: Suppose I want to encode
: a string for use as a value in HTML or XML, e.g.
: s/\</\&lt/g;
: s/\>/\&gt/g;  // useful even when not strictly necessary
: s/\"/&quot/g;

my %char2entity = qw(
   <  &lt;
   >  &gt;
   "  &quot;
   &  &amp;
);
my $charRE = join '|', reverse sort keys %char2entity;

s/($charRE)/$char2entity{$1}/go;

: Currently this requires three passes over the data.
: If tr/// allowed strings on the right hand side

   s///g allows strings on the RHS  :-)

--
    Tad McClellan                          SGML Consulting

    Fort Worth, Texas



Thu, 04 Apr 2002 03:00:00 GMT  
 Substituting many strings in one pass over the input


Quote:

> : I am looking for a high performance way to substitute many
> : strings in one pass over the input.  

>    Use a hash to define the replacement pairs, and build
>    a regex of the hash's keys.

> : Suppose I want to encode
> : a string for use as a value in HTML or XML, e.g.
> : s/\</\&lt/g;
> : s/\>/\&gt/g;  // useful even when not strictly necessary
> : s/\"/&quot/g;

> my %char2entity = qw(
>    <  &lt;
>    >  &gt;
>    "  &quot;
>    &  &amp;
> );
> my $charRE = join '|', reverse sort keys %char2entity;

There must be some reason why you reverse-sort the keys, or why you sort
them at all.  Is there lexicographic significance to the ordering of
those symbols?

Quote:
> s/($charRE)/$char2entity{$1}/go;

If he's 'looking for a high performance way',  your's is OK, but the
character class is better.

Benchmark: timing 65536 iterations of Class, Or...
     Class:  9 wallclock secs ( 8.60 usr +  0.00 sys =  8.60 CPU)
        Or: 15 wallclock secs (14.22 usr +  0.00 sys = 14.22 CPU)

#!/usr/local/bin/perl -w
use strict;
use Benchmark;

my $x = 'This is an HTML <!-- comment --> & <img src="foo.bar">';
my %char2entity = qw(
   <  &lt;
   >  &gt;
   "  &quot;
   &  &amp;
);
my $charRE = join '|', reverse sort keys %char2entity;
my $charCL = '[' . join("", keys %char2entity) . ']';

timethese(1 << (shift || 0), {
  Class  => sub { (my $y = $x) =~ s/($charCL)/$char2entity{$1}/go },
  Or     => sub { (my $y = $x) =~ s/($charRE)/$char2entity{$1}/go },

Quote:
});

--
(Just Another Larry) Rosler
Hewlett-Packard Laboratories
http://www.hpl.hp.com/personal/Larry_Rosler/



Fri, 05 Apr 2002 03:00:00 GMT  
 Substituting many strings in one pass over the input


: > my %char2entity = qw(
: >    <  &lt;
: >    >  &gt;
: >    "  &quot;
: >    &  &amp;
: > );
: > my $charRE = join '|', reverse sort keys %char2entity;

: There must be some reason why you reverse-sort the keys, or why you sort
: them at all.  Is there lexicographic significance to the ordering of
: those symbols?

   Uhhh, that just springs naturally to my fingers when I
   build up an RE like that. I didn't notice that I don't
   need my standard idiom there.

   With more "normal" (i.e. real strings) hash keys, it makes the
   "left-most greedy" find the longer one (because it is leftmost
   in the sorted order).

   That is, with 'cat' and 'catalina', I want

      /(catalina|cat)/

   rather than

      /(cat|catalina)/

--
    Tad McClellan                          SGML Consulting

    Fort Worth, Texas



Fri, 05 Apr 2002 03:00:00 GMT  
 Substituting many strings in one pass over the input

Quote:

>> But now comes the hard part -- we also needed
>> s/\&/&amp/g;

>There is a falw^Wflaw in your logic. The stuff that goes in is
>supposed to be plain text, right? The stuff that comes out should be
>the XML or HTML representation of that text, right? That means that if
>there is _any_ literal '&' in your input, it needs to be translated
>into &amp;, even if it is part of something like &amp;. Multiple
>levels of interpretation require multiple levels of escapement.

All he is saying is he doesn't want to touch &s introduced by
previous passes. The solution is simple, though--do s/&/&amp;/g
first. (And when decoding, do it last.)


Fri, 05 Apr 2002 03:00:00 GMT  
 Substituting many strings in one pass over the input
On 18 Oct 1999 16:38:51 GMT,

Quote:

> >> But now comes the hard part -- we also needed
> >> s/\&/&amp/g;

> >There is a falw^Wflaw in your logic. The stuff that goes in is
> >supposed to be plain text, right? The stuff that comes out should be
> >the XML or HTML representation of that text, right? That means that if
> >there is _any_ literal '&' in your input, it needs to be translated
> >into &amp;, even if it is part of something like &amp;. Multiple
> >levels of interpretation require multiple levels of escapement.

> All he is saying is he doesn't want to touch &s introduced by
> previous passes. The solution is simple, though--do s/&/&amp;/g
> first. (And when decoding, do it last.)

I know what he is saying. I am just remarking that what he is saying
is not necessarily what he wants.

Suppose you have a document that talks about the entities in HTML. It
contains a paragraph like: "To get an ampersand (&), you would insert
&amp; in your HTML." Code that blindly assumes that all escapements
are not to be escaped would translate that to "<P>To get an ampersand
(&amp;) you would insert &amp; in your HTML.</P>", and translating
that back, it would become: "To get an ampersand (&), you would insert
& in your HTML.", which of course is wrong.

HTML normally only gets interpreted once, but in the XML world it's
not uncommon to have multiple levels of interpretation, which is why
the spec actually talks about that. Software that tries to do this
sort of thing should do this correctly.

The best way of avoiding unwanted translations is to not feed the
stuff to your trabslator in the first place. Assuming that people who
type '&amp;' in a string actually meant '&' is odd behaviour. Or are
you going to tell people that if they really want &amp; to appear that
they should type &amp;amp;?

If he indeed wants that sort of behaviour, he's free to do that. It's
trivial not to replace already escaped sequences. Your solution,
however, doesn't cut it.

Martien
--
Martien Verbruggen                      |
Interactive Media Division              | "In a world without fences,
Commercial Dynamics Pty. Ltd.           |  who needs Gates?"
NSW, Australia                          |



Fri, 05 Apr 2002 03:00:00 GMT  
 
 [ 8 post ] 

 Relevant Pages 

1. Default Buttons

2. signal processing literature/algorithms

3. split and substitute, substitute, substitute

4. Substituting n occurences of a char with one

5. Substitute across lines in "one liner"

6. Regex problem: substituting 3+ identical characters with just one

7. question: How does one substitute into something other $_?

8. Simple Sprites

9. midas

10. Packing multiple strings into one string

11. Substituting many sparsely distrbuted strings in many files

12. substituting strings in source trees?

 

 
Powered by phpBB® Forum Software