Probably a simple one 
Author Message
 Probably a simple one

RM> I am trying to extract the href from links in HTML, however the
RM> regular expression matcher doesn't appear to stop in the correct
RM> place.

RM> I intend the regular expression to extrace the href that is
RM> enclosed in quotes and return that into $1. However is seems to
RM> 'miss' the first set of quotes and a following one and finally
RM> stop on a third set.

RM> Here is an example

RM> s="<A  href=\"l.htm\">xxx <IMG  src=\"images/la.gif\" width=4></A>"

RM> s =~ /<.*A.*href *= *"(.*)".*>/   => 0

$1   =>>  "l.htm\">xxx <IMG  src=\"images/la.gif"

s = '<A  href="l.htm">xxx <IMG  src="images/la.gif" width=4></A>'
s =~ /<a\s+href\s*=\s*\"*([^>\"\s]+).*/i
p $1

the above gets everything between the href= and the > (excluding
the ending space or quote if there).  Is that what you were looking
for?

regards,
-joe



Sat, 17 Apr 2004 16:19:44 GMT  
 Probably a simple one
Thanks Joe,

I can see how your expression works however I still mystified by the
behavior of my one.

If  .* will just keep matching to the end of the line then my version should
have returned nil as the whole expression would never have been matched.
But it didn't

I have just been playing with this

z="abc"
z=/(.*b)/    => 0
$1  => "ab"
z=~/(.*)b/ => 0
$1  => "a"

So it seems that .* is matching in this case until it finds the character
following. But in the example below it doesn't.  It seems to randomly stop.

Can anyone clarify this?

Thanks

Quote:

> RM> s="<A  href=\"l.htm\">xxx <IMG  src=\"images/la.gif\" width=4></A>"
> RM> s =~ /<.*A.*href *= *"(.*)".*>/   => 0
> $1   =>>  "l.htm\">xxx <IMG  src=\"images/la.gif"

> s = '<A  href="l.htm">xxx <IMG  src="images/la.gif" width=4></A>'
> s =~ /<a\s+href\s*=\s*\"*([^>\"\s]+).*/i
> p $1

> the above gets everything between the href= and the > (excluding
> the ending space or quote if there).  Is that what you were looking
> for?

> regards,
> -joe



Sat, 17 Apr 2004 16:55:41 GMT  
 Probably a simple one
Hello --

(Disclaimer: I am aware of the fact that HTML cannot be parsed with a
single regular expression.  I just want to help the guy grasp the
syntax :-)

Quote:

> Thanks Joe,

> I can see how your expression works however I still mystified by the
> behavior of my one.

> If  .* will just keep matching to the end of the line then my version should
> have returned nil as the whole expression would never have been matched.
> But it didn't

> I have just been playing with this

> z="abc"
> z=/(.*b)/    => 0
> $1  => "ab"
> z=~/(.*)b/ => 0
> $1  => "a"

> So it seems that .* is matching in this case until it finds the character
> following. But in the example below it doesn't.  It seems to randomly stop.

No, it stops at exactly the point you've asked it to :-)

Quote:
> > RM> s="<A  href=\"l.htm\">xxx <IMG  src=\"images/la.gif\" width=4></A>"
> > RM> s =~ /<.*A.*href *= *"(.*)".*>/   => 0
> > $1   =>>  "l.htm\">xxx <IMG  src=\"images/la.gif"

This part of your pattern:

    "(.*)"

means:

    a double quote, followed by zero or more of any character (except newline),
    followed by another double quote.

and it saves the "zero or more..." part to $1, because that's what the
parens are around.

Note that .* matches are "greedy".  That is, the match goes as far
along in the string as it can and still succeed in matching.  "Zero or
more" is really "as many as possible".  It has to stop consuming
characters when it gets to the last double quote, because if it
consumes that, the match will fail (because the match calls for a
double quote *after* the .* part).  But it can consume all the double
quotes prior to the last one.

You can do a "non-greedy" match using /.*?/ instead of /.*/.  For example:

s="<A  href=\"l.htm\">xxx <IMG  src=\"images/la.gif\" width=4></A>"
s =~ /<.*A.*href *= *"(.*?)".*>/

  # $1 =>  l.htm

David

--
David Alan Black


Web:  http://pirate.shu.edu/~blackdav



Sat, 17 Apr 2004 19:37:27 GMT  
 
 [ 3 post ] 

 Relevant Pages 

1. Problem with a piece of code (probably a simple one)

2. Something probably simple...

3. probably a simple problem

4. I have a very simple..probably stupid..clipper question

5. Probably a simple newbie question...

6. Problem (probably simple) that I can't solve

7. Bug ... probably another one

8. Probably a simple question

9. Help! Probably a simple problem

10. Probably really simple but...

11. Please help with query - probably very simple...

12. Problem with Menu generation, probably a dumb one ...

 

 
Powered by phpBB® Forum Software