Need help substituting text except when in an HTML anchor
Author Message
Need help substituting text except when in an HTML anchor

I want to be able to substitute for text in HTML that's matched only
when it doesn't appear in an anchor. For example, given

blah blah foo blah <em>foo</em> blah <a href="foo.html">foo</a> blah blah

I'd want the first and second instances of the word "foo" to match, and
be substituted for, but not the third or fourth (because they're in the
anchor). Notice that foo can be in another element, as with EM in the
example above.

My reason for this requirement is that the substitution I want to do is
to make an occurence of the sought-after string into an anchor,
replacing, for example, 'foo' with '<a href="foo.html">foo</a>'. The
problem is I want to do this in a situation where the program can be run
again and again with the same input text, so I don't want text that's
already been made an anchor to be modified again. (I.e., I don't want

<a href="<a href="foo.html">foo</a>.html"><a href="foo.html">foo</a></a>

the second time through.)

I hope I've made it clear what I'm after. Any help would be appreciated.

--
Cole Robison
Software Training Specialist
Academic Computing Services, The University of Kansas

Mon, 21 Apr 2003 07:11:42 GMT
Need help substituting text except when in an HTML anchor

Quote:

> I want to be able to substitute for text in HTML that's matched only
> when it doesn't appear in an anchor. For example, given

> blah blah foo blah <em>foo</em> blah <a href="foo.html">foo</a> blah blah

> I'd want the first and second instances of the word "foo" to match, and
> be substituted for, but not the third or fourth (because they're in the
> anchor). Notice that foo can be in another element, as with EM in the
> example above.

> My reason for this requirement is that the substitution I want to do is
> to make an occurence of the sought-after string into an anchor,
> replacing, for example, 'foo' with '<a href="foo.html">foo</a>'. The
> problem is I want to do this in a situation where the program can be run
> again and again with the same input text, so I don't want text that's
> already been made an anchor to be modified again. (I.e., I don't want

> <a href="<a href="foo.html">foo</a>.html"><a href="foo.html">foo</a></a>

> the second time through.)

> I hope I've made it clear what I'm after. Any help would be appreciated.

> --
> Cole Robison
> Software Training Specialist
> Academic Computing Services, The University of Kansas

hey, here is copy of perl file that replaces a words in HTML file, but not
in tags <a href=...>
probably what u want. but of course u need to change what it does with each
word. (ie not replace but extract,
print...or something)

changewords.pl
--------------------
#!/usr/bin/perl -w
# changewords.pl - make substitutions in normal text of HTML files

sub usage { die "Usage: \$0 <from> <to> <file> ...\n" }

my \$from = shift or usage;
my \$to = shift or usage;

# Build the HTML::Filter subclass to do the substituting.

package MyFilter;

use HTML::Entities qw(decode_entities encode_entities);

sub text {
my \$self = shift;
my \$text = decode_entities(\$_[0]);
\$text =~ s/\Q\$from/\$to/go; #most important line
\$self->SUPER::text(encode_entities(\$text));

Quote:
}

# now use the class

package main;

MyFilter->new->parse_file(\$_);

- Show quoted text -

Quote:
}

Sun, 20 Apr 2003 18:32:02 GMT
Need help substituting text except when in an HTML anchor

Quote:

> I want to be able to substitute for text in HTML that's matched only
> when it doesn't appear in an anchor. For example, given

> blah blah foo blah <em>foo</em> blah <a href="foo.html">foo</a> blah blah

> I'd want the first and second instances of the word "foo" to match, and
> be substituted for, but not the third or fourth (because they're in the
> anchor). Notice that foo can be in another element, as with EM in the
> example above.

> My reason for this requirement is that the substitution I want to do is
> to make an occurence of the sought-after string into an anchor,
> replacing, for example, 'foo' with '<a href="foo.html">foo</a>'. The
> problem is I want to do this in a situation where the program can be run
> again and again with the same input text, so I don't want text that's
> already been made an anchor to be modified again. (I.e., I don't want

> <a href="<a href="foo.html">foo</a>.html"><a href="foo.html">foo</a></a>

> the second time through.)

> I hope I've made it clear what I'm after. Any help would be appreciated.

You probably need to use the HTML::Parser (or, more likely,
HTML::Filter) module for this.  It really is to complicated to
roll-your-own solution.

I've needed to practice this, so here you go (needs some error
checking):

#!/usr/local/bin/perl
use strict;
use warnings;

package FilterTextNoAnchors;
use base qw/HTML::Filter/;

sub new {
my \$invocant = shift;
my \$filter = shift;    # sub ref to filter \$_
my \$class = ref \$invocant || \$invocant;

\$self->{filter} = \$filter;
return \$self;

Quote:
}

sub text {
my \$self = shift;
local \$_ = shift;
\$self->{filter}() unless \$self->{anchor};
\$self->output(\$_);

Quote:
}

sub start {
my \$self = shift;
\$self->{anchor}++ if \$_[0] eq "a";

Quote:
}

sub end {
my \$self = shift;

\$self->{anchor}-- if \$_[0] eq "a";

Quote:
}

package main;
my \$p = FilterTextNoAnchors->new(sub { s/\bfoo\b/bar/ig } );
\$p->parse( join "", <DATA> );
\$p->eof;

__DATA__
blah blah foo blah <em>foo</em> blah <a href="foo.html">foo</a> blah blah
__END__

Hope that helps you as much as it helped me to write it!

--

Mon, 21 Apr 2003 07:49:29 GMT

 Page 1 of 1 [ 3 post ]

Relevant Pages