RegExp Char Class 
Author Message
 RegExp Char Class

Hi,

I am a little puzzled by what I would expect to be a simple and
successful match. I have a string containing two html anchor tags. The
text of the two tags differ and I am trying to match the second anchor
tag in this particular case with the word NEXT.

Below is some test code. I am trying to collect the href attribute
value from the second anchor tag knowing only the text of the anchor
tag is the word "NEXT", and the order of the attributes (href first,
title second).

I would expect all three of these to match and return 456 as $1.
But only the last one matches successfully. The first one does not
match at all and the second one matches the wrong text.

Am I doing something wrong with the negated character class in
REGEXP1?
Is the non-greedy match in REGEXP2 wrong?

I don't like having to use REGEXP3, but that is the only pattern
working at this time.

I would appreciate any comments you might have.

#!/usr/local/bin/perl -w
use strict;
#use re 'debug';

my $s = '<a href="123" title="abc">LAST</a><a href="456"
title="abc">NEXT</a>';

if ($s =~ /<a\s+href="([^"])"\s+title="[^"]">NEXT<\/a>/) {
        print "REGEXP 1: ", defined $1 ? "matched [$1]" :
"failed","\n";

Quote:
}

if ($s =~ /<a\s+href="(\S+?)"\s+title=".+?">NEXT<\/a>/) {
        print "REGEXP 2: ", defined $1 ? "matched [$1]" :
"failed","\n";

Quote:
}

if ($s =~ /<a\s+href="([A-Za-z0-9]+)"\s+title="[A-Za-z0-9\s]+">NEXT<\/a>/)
{
        print "REGEXP 3: ", defined $1 ? "matched [$1]" :
"failed","\n";

Quote:
}

# output:
REGEXP 2: matched [123]
REGEXP 3: matched [456]

Thanks!
Frank



Sun, 26 Dec 2004 21:47:37 GMT  
 RegExp Char Class

Quote:

> I am a little puzzled by what I would expect to be a simple and
> successful match.
> Below is some test code. I am trying to collect the href attribute
> value from the second anchor tag knowing only the text of the anchor

                                           ^^^^

Quote:
> tag is the word "NEXT", and the order of the attributes (href first,
> title second).

Your code embodies more assumptions than just that though.

This is legal HTML, for example:

   <a href = "456" title="abc">NEXT</a>

   <a href='456' title="abc">NEXT</a>

   <a
   href
   =
   "456"
   title
   =
   "abc"
   >NEXT</a
   >

If you do not control the format of the HTML you are processing,
then you should use a module that understands HTML instead of
pattern matching.

Quote:
> I would expect all three of these to match and return 456 as $1.
> But only the last one matches successfully. The first one does not
> match at all

You have written the pattern incorrectly.

Quote:
> and the second one matches the wrong text.

"non greedy" really means: prefer shortest possible that still
                           *allows the overall match to succeed*.

It has to steamroller over some angle brackets in order to succeed.

Quote:
> Am I doing something wrong with the negated character class in
> REGEXP1?

No, but you do not allow the char class to repeat.

Quote:
> Is the non-greedy match in REGEXP2 wrong?

Nope, that is how it is supposed to work. Use [^"]* instead.

Quote:
> if ($s =~ /<a\s+href="([^"])"\s+title="[^"]">NEXT<\/a>/) {

                         ^^^^
                         ^^^^

Requires exactly one character between the quotes, should
be zero or more instead.

Quote:
>         print "REGEXP 1: ", defined $1 ? "matched [$1]" :
> "failed","\n";

If control gets inside the if block, then $1 will always be defined
for the pattern you've written. No need for that test.

--
    Tad McClellan                          SGML consulting

    Fort Worth, Texas



Mon, 27 Dec 2004 00:17:52 GMT  
 RegExp Char Class


Quote:
> Hi,

> I am a little puzzled by what I would expect to be a simple and
> successful match. I have a string containing two html anchor tags. The
> text of the two tags differ and I am trying to match the second anchor
> tag in this particular case with the word NEXT.

> Below is some test code. I am trying to collect the href attribute
> value from the second anchor tag knowing only the text of the anchor
> tag is the word "NEXT", and the order of the attributes (href first,
> title second).

> I would expect all three of these to match and return 456 as $1.
> But only the last one matches successfully. The first one does not
> match at all and the second one matches the wrong text.

> Am I doing something wrong with the negated character class in
> REGEXP1?
> Is the non-greedy match in REGEXP2 wrong?

> I don't like having to use REGEXP3, but that is the only pattern
> working at this time.

> I would appreciate any comments you might have.

> #!/usr/local/bin/perl -w
> use strict;
> #use re 'debug';

> my $s = '<a href="123" title="abc">LAST</a><a href="456"
> title="abc">NEXT</a>';

> if ($s =~ /<a\s+href="([^"])"\s+title="[^"]">NEXT<\/a>/) {
>         print "REGEXP 1: ", defined $1 ? "matched [$1]" :
> "failed","\n";
> }

> if ($s =~ /<a\s+href="(\S+?)"\s+title=".+?">NEXT<\/a>/) {
>         print "REGEXP 2: ", defined $1 ? "matched [$1]" :
> "failed","\n";
> }

> if ($s =~ /<a\s+href="([A-Za-z0-9]+)"\s+title="[A-Za-z0-9\s]+">NEXT<\/a>/)
> {
>         print "REGEXP 3: ", defined $1 ? "matched [$1]" :
> "failed","\n";
> }

> # output:
> REGEXP 2: matched [123]
> REGEXP 3: matched [456]

> Thanks!
> Frank

When I execute your script with my copy of ActivePerl 5.6.1,  REGEXP 2 and
REGEXP 3 both match 456.  When I insert "+" after both negated classes in
REGEXP1, it also matches 456.
I do not have any idea why you have a problem with REGEXP 2.

Bill



Mon, 27 Dec 2004 00:53:44 GMT  
 RegExp Char Class

Quote:

> Hi,

> I am a little puzzled by what I would expect to be a simple and
> successful match. I have a string containing two html anchor tags. The
> text of the two tags differ and I am trying to match the second anchor
> tag in this particular case with the word NEXT.

> Below is some test code. I am trying to collect the href attribute
> value from the second anchor tag knowing only the text of the anchor
> tag is the word "NEXT", and the order of the attributes (href first,
> title second).

> I would expect all three of these to match and return 456 as $1.
> But only the last one matches successfully. The first one does not
> match at all and the second one matches the wrong text.

> Am I doing something wrong with the negated character class in
> REGEXP1?

You need * or + after the class to match more than one char.

Quote:
> Is the non-greedy match in REGEXP2 wrong?

Hm, non-greedy means try shorter match first, not try last match first.
Perhaps it should be considered a bug? I hav not thought about it
before. The .*? in the title matches everything from abc">LAST...
to the second title. This is because a shorter interpratation (stopping
at the first ") fails at LAST which is not NEXT. You must force
a failure somewhere, e.g., use \S+? instead, that will make the
pattern fail when seeingg the space after <a. Then it will backtrack
and try the next match for the initial part of the pattern.

/Enrique

- Show quoted text -

Quote:

> I don't like having to use REGEXP3, but that is the only pattern
> working at this time.

> I would appreciate any comments you might have.

> #!/usr/local/bin/perl -w
> use strict;
> #use re 'debug';

> my $s = '<a href="123" title="abc">LAST</a><a href="456"
> title="abc">NEXT</a>';

> if ($s =~ /<a\s+href="([^"])"\s+title="[^"]">NEXT<\/a>/) {
>         print "REGEXP 1: ", defined $1 ? "matched [$1]" :
> "failed","\n";
> }

> if ($s =~ /<a\s+href="(\S+?)"\s+title=".+?">NEXT<\/a>/) {
>         print "REGEXP 2: ", defined $1 ? "matched [$1]" :
> "failed","\n";
> }

> if ($s =~ /<a\s+href="([A-Za-z0-9]+)"\s+title="[A-Za-z0-9\s]+">NEXT<\/a>/)
> {
>         print "REGEXP 3: ", defined $1 ? "matched [$1]" :
> "failed","\n";
> }

> # output:
> REGEXP 2: matched [123]
> REGEXP 3: matched [456]

> Thanks!
> Frank



Mon, 27 Dec 2004 23:29:27 GMT  
 RegExp Char Class

Quote:


>> Is the non-greedy match in REGEXP2 wrong?

> Hm, non-greedy means try shorter match first, not try last match first.
> Perhaps it should be considered a bug?

It is a feature, not a bug.

Quote:
> I hav not thought about it
> before. The .*? in the title matches everything from abc">LAST...
> to the second title. This is because a shorter interpratation (stopping
> at the first ") fails at LAST which is not NEXT.

   perldoc perlre

(my emphasis)
----------------------------------------------------
By default, a quantified subpattern is "greedy", that is, it will match as
many times as possible (given a particular starting location) while still
                                                                    ^^^^^
allowing the rest of the pattern to match.  If you want it to match the
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
minimum number of times possible, follow the quantifier with a "?".  Note
that the meanings don't change, just the "greediness":
     ^^^^^^^^^^^^^^^^^^^^^^^^^
----------------------------------------------------

Quote:
>> if ($s =~ /<a\s+href="(\S+?)"\s+title=".+?">NEXT<\/a>/) {
>>         print "REGEXP 2: ", defined $1 ? "matched [$1]" :
>> "failed","\n";

The match operator's prime directive is to try to match.

Only if the match succeeds does "greediness" come into play.

Greediness never changes whether a match will pass or fail,
it only changes which characters will be matched.

--
    Tad McClellan                          SGML consulting

    Fort Worth, Texas



Tue, 28 Dec 2004 00:07:16 GMT  
 
 [ 5 post ] 

 Relevant Pages 

1. Regexp char class: mixing ranges and negation?

2. Why no regex char class for punc chars???

3. matching . *and* \n in a char class

4. s2p fails with quotes in char classes

5. backslash escaping not metacharacters in a char class

6. different handling of = on unix and windows (in char class)

7. different handling of = on unix and windows (in char class)

8. Defining regular expression char class

9. regex: ? in char class problem

10. regexp to negate 2 chars at once

11. HELP: deactivating regexp-active chars from string?

12. regexp for strings of chars

 

 
Powered by phpBB® Forum Software