Author |
Message |
Frank Scon #1 / 5
|
 RegExp Char Class
Hi, I am a little puzzled by what I would expect to be a simple and successful match. I have a string containing two html anchor tags. The text of the two tags differ and I am trying to match the second anchor tag in this particular case with the word NEXT. Below is some test code. I am trying to collect the href attribute value from the second anchor tag knowing only the text of the anchor tag is the word "NEXT", and the order of the attributes (href first, title second). I would expect all three of these to match and return 456 as $1. But only the last one matches successfully. The first one does not match at all and the second one matches the wrong text. Am I doing something wrong with the negated character class in REGEXP1? Is the non-greedy match in REGEXP2 wrong? I don't like having to use REGEXP3, but that is the only pattern working at this time. I would appreciate any comments you might have. #!/usr/local/bin/perl -w use strict; #use re 'debug'; my $s = '<a href="123" title="abc">LAST</a><a href="456" title="abc">NEXT</a>'; if ($s =~ /<a\s+href="([^"])"\s+title="[^"]">NEXT<\/a>/) { print "REGEXP 1: ", defined $1 ? "matched [$1]" : "failed","\n"; Quote: }
if ($s =~ /<a\s+href="(\S+?)"\s+title=".+?">NEXT<\/a>/) { print "REGEXP 2: ", defined $1 ? "matched [$1]" : "failed","\n"; Quote: }
if ($s =~ /<a\s+href="([A-Za-z0-9]+)"\s+title="[A-Za-z0-9\s]+">NEXT<\/a>/) { print "REGEXP 3: ", defined $1 ? "matched [$1]" : "failed","\n"; Quote: }
# output: REGEXP 2: matched [123] REGEXP 3: matched [456] Thanks! Frank
|
Sun, 26 Dec 2004 21:47:37 GMT |
|
 |
Tad McClell #2 / 5
|
 RegExp Char Class
Quote:
> I am a little puzzled by what I would expect to be a simple and > successful match. > Below is some test code. I am trying to collect the href attribute > value from the second anchor tag knowing only the text of the anchor
^^^^ Quote: > tag is the word "NEXT", and the order of the attributes (href first, > title second).
Your code embodies more assumptions than just that though. This is legal HTML, for example: <a href = "456" title="abc">NEXT</a> <a href='456' title="abc">NEXT</a> <a href = "456" title = "abc" >NEXT</a > If you do not control the format of the HTML you are processing, then you should use a module that understands HTML instead of pattern matching. Quote: > I would expect all three of these to match and return 456 as $1. > But only the last one matches successfully. The first one does not > match at all
You have written the pattern incorrectly. Quote: > and the second one matches the wrong text.
"non greedy" really means: prefer shortest possible that still *allows the overall match to succeed*. It has to steamroller over some angle brackets in order to succeed. Quote: > Am I doing something wrong with the negated character class in > REGEXP1?
No, but you do not allow the char class to repeat. Quote: > Is the non-greedy match in REGEXP2 wrong?
Nope, that is how it is supposed to work. Use [^"]* instead. Quote: > if ($s =~ /<a\s+href="([^"])"\s+title="[^"]">NEXT<\/a>/) {
^^^^ ^^^^ Requires exactly one character between the quotes, should be zero or more instead. Quote: > print "REGEXP 1: ", defined $1 ? "matched [$1]" : > "failed","\n";
If control gets inside the if block, then $1 will always be defined for the pattern you've written. No need for that test. -- Tad McClellan SGML consulting
Fort Worth, Texas
|
Mon, 27 Dec 2004 00:17:52 GMT |
|
 |
Bill Smit #3 / 5
|
 RegExp Char Class
Quote: > Hi, > I am a little puzzled by what I would expect to be a simple and > successful match. I have a string containing two html anchor tags. The > text of the two tags differ and I am trying to match the second anchor > tag in this particular case with the word NEXT. > Below is some test code. I am trying to collect the href attribute > value from the second anchor tag knowing only the text of the anchor > tag is the word "NEXT", and the order of the attributes (href first, > title second). > I would expect all three of these to match and return 456 as $1. > But only the last one matches successfully. The first one does not > match at all and the second one matches the wrong text. > Am I doing something wrong with the negated character class in > REGEXP1? > Is the non-greedy match in REGEXP2 wrong? > I don't like having to use REGEXP3, but that is the only pattern > working at this time. > I would appreciate any comments you might have. > #!/usr/local/bin/perl -w > use strict; > #use re 'debug'; > my $s = '<a href="123" title="abc">LAST</a><a href="456" > title="abc">NEXT</a>'; > if ($s =~ /<a\s+href="([^"])"\s+title="[^"]">NEXT<\/a>/) { > print "REGEXP 1: ", defined $1 ? "matched [$1]" : > "failed","\n"; > } > if ($s =~ /<a\s+href="(\S+?)"\s+title=".+?">NEXT<\/a>/) { > print "REGEXP 2: ", defined $1 ? "matched [$1]" : > "failed","\n"; > } > if ($s =~ /<a\s+href="([A-Za-z0-9]+)"\s+title="[A-Za-z0-9\s]+">NEXT<\/a>/) > { > print "REGEXP 3: ", defined $1 ? "matched [$1]" : > "failed","\n"; > } > # output: > REGEXP 2: matched [123] > REGEXP 3: matched [456] > Thanks! > Frank
When I execute your script with my copy of ActivePerl 5.6.1, REGEXP 2 and REGEXP 3 both match 456. When I insert "+" after both negated classes in REGEXP1, it also matches 456. I do not have any idea why you have a problem with REGEXP 2. Bill
|
Mon, 27 Dec 2004 00:53:44 GMT |
|
 |
Enriqu #4 / 5
|
 RegExp Char Class
Quote:
> Hi, > I am a little puzzled by what I would expect to be a simple and > successful match. I have a string containing two html anchor tags. The > text of the two tags differ and I am trying to match the second anchor > tag in this particular case with the word NEXT. > Below is some test code. I am trying to collect the href attribute > value from the second anchor tag knowing only the text of the anchor > tag is the word "NEXT", and the order of the attributes (href first, > title second). > I would expect all three of these to match and return 456 as $1. > But only the last one matches successfully. The first one does not > match at all and the second one matches the wrong text. > Am I doing something wrong with the negated character class in > REGEXP1?
You need * or + after the class to match more than one char. Quote: > Is the non-greedy match in REGEXP2 wrong?
Hm, non-greedy means try shorter match first, not try last match first. Perhaps it should be considered a bug? I hav not thought about it before. The .*? in the title matches everything from abc">LAST... to the second title. This is because a shorter interpratation (stopping at the first ") fails at LAST which is not NEXT. You must force a failure somewhere, e.g., use \S+? instead, that will make the pattern fail when seeingg the space after <a. Then it will backtrack and try the next match for the initial part of the pattern. /Enrique Quote: > I don't like having to use REGEXP3, but that is the only pattern > working at this time. > I would appreciate any comments you might have. > #!/usr/local/bin/perl -w > use strict; > #use re 'debug'; > my $s = '<a href="123" title="abc">LAST</a><a href="456" > title="abc">NEXT</a>'; > if ($s =~ /<a\s+href="([^"])"\s+title="[^"]">NEXT<\/a>/) { > print "REGEXP 1: ", defined $1 ? "matched [$1]" : > "failed","\n"; > } > if ($s =~ /<a\s+href="(\S+?)"\s+title=".+?">NEXT<\/a>/) { > print "REGEXP 2: ", defined $1 ? "matched [$1]" : > "failed","\n"; > } > if ($s =~ /<a\s+href="([A-Za-z0-9]+)"\s+title="[A-Za-z0-9\s]+">NEXT<\/a>/) > { > print "REGEXP 3: ", defined $1 ? "matched [$1]" : > "failed","\n"; > } > # output: > REGEXP 2: matched [123] > REGEXP 3: matched [456] > Thanks! > Frank
|
Mon, 27 Dec 2004 23:29:27 GMT |
|
 |
Tad McClell #5 / 5
|
 RegExp Char Class
Quote:
>> Is the non-greedy match in REGEXP2 wrong? > Hm, non-greedy means try shorter match first, not try last match first. > Perhaps it should be considered a bug?
It is a feature, not a bug. Quote: > I hav not thought about it > before. The .*? in the title matches everything from abc">LAST... > to the second title. This is because a shorter interpratation (stopping > at the first ") fails at LAST which is not NEXT.
perldoc perlre (my emphasis) ---------------------------------------------------- By default, a quantified subpattern is "greedy", that is, it will match as many times as possible (given a particular starting location) while still ^^^^^ allowing the rest of the pattern to match. If you want it to match the ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ minimum number of times possible, follow the quantifier with a "?". Note that the meanings don't change, just the "greediness": ^^^^^^^^^^^^^^^^^^^^^^^^^ ---------------------------------------------------- Quote: >> if ($s =~ /<a\s+href="(\S+?)"\s+title=".+?">NEXT<\/a>/) { >> print "REGEXP 2: ", defined $1 ? "matched [$1]" : >> "failed","\n";
The match operator's prime directive is to try to match. Only if the match succeeds does "greediness" come into play. Greediness never changes whether a match will pass or fail, it only changes which characters will be matched. -- Tad McClellan SGML consulting
Fort Worth, Texas
|
Tue, 28 Dec 2004 00:07:16 GMT |
|
|
|