Help with regular expression 
Author Message
 Help with regular expression

[Posted to comp.lang.perl.misc and copy mailed.]



Quote:
> I'm trying to parse table rows in a pathological html document. The document
> doesn't have any closing tags for <table>, <tr>, <th>, or <td>. This is what I

I'll bet it has </table> or it wouldn't display at all.

Quote:
> tried that did not work:

>   while ( $html =~ m#(<tr>.*?)<tr>)#gs ) {
>       # do stuff
>    }

You couldn't have tried exactly that, because the parentheses don't
match.  However...

Quote:
> It picks up every other row because I have run past the beginning of the row
> tag with the second "<tr>".  Is there a way to tell perl to 'put back' the four
> characters before continuing the match? How should I be doing this?

(?=...) means look but don't touch.  So try this:

    while ( $html =~ m#(<tr>.*?)(?=<tr>)#gis ) {

Now the parentheses match :-).  I've added the /i modifier because you  
want to ignore case for HTML tags.

How will you deal with the last row of the table (i.e., no following
<tr>)?  You might want to change that look-ahead to (?=(<tr>|</table>)).  
As I said, I'll bet there is an </table> tag.

HTH,

--
Larry Rosler
Hewlett-Packard Laboratories
http://www.*-*-*.com/



Wed, 24 Jan 2001 03:00:00 GMT  
 Help with regular expression
I'm trying to parse table rows in a pathological html document. The document
doesn't have any closing tags for <table>, <tr>, <th>, or <td>. This is what I
tried that did not work:

  while ( $html =~ m#(<tr>.*?)<tr>)#gs ) {
      # do stuff
   }

It picks up every other row because I have run past the beginning of the row
tag with the second "<tr>".  Is there a way to tell perl to 'put back' the four
characters before continuing the match? How should I be doing this?

Thanks!

Andrew Robinson
Andrew Robinson
---
Disclaimer: The opinions expressed are mine alone and do not represent the
views of America Online



Thu, 25 Jan 2001 03:00:00 GMT  
 Help with regular expression

Quote:

}I'm trying to parse table rows in a pathological html document. The document
}doesn't have any closing tags for <table>, <tr>, <th>, or <td>. This is what I
}tried that did not work:
}  while ( $html =~ m#(<tr>.*?)<tr>)#gs ) {
}      # do stuff
}   }
}It picks up every other row because I have run past the beginning of the row
}tag with the second "<tr>".  Is there a way to tell perl to 'put back' the four
}characters before continuing the match? How should I be doing this?

Look up "lookahead" in the man pages (or for true insights, Friedl's
Mastering Regular Expressions.)  Then use it.

--



Thu, 25 Jan 2001 03:00:00 GMT  
 Help with regular expression
: How will you deal with the last row of the table (i.e., no following
: <tr>)?  You might want to change that look-ahead to (?=(<tr>|</table>)).  

First, you'll want to make thos non-capturing parens or you'll mess up
what gets returned on each /g iteration.  Second, there's a minor
optimization possible which only matters if you're doing this a lot --
putting the common < and > outside the alternation.  Third, though this
becomes irrelvant, the parent grouping around the alternation is
unnecessary.  Combining the these:

  (?=<(?:tr|/table)>)

---------------------------------------------------------------------

 --*--    Home Page: http://www.cinenet.net/users/cberry/home.html
   |      Member of The HTML Writers Guild: http://www.hwg.org/  
       "Every man and every woman is a star."



Thu, 25 Jan 2001 03:00:00 GMT  
 Help with regular expression

Quote:
>(?=...) means look but don't touch.  So try this:

>  while ( $html =~ m#(<tr>.*?)(?=<tr>)#gis ) {

Thanks to everyone who suggested lookahead. That was what I needed and did not
know.

Quote:
>As I said, I'll bet there is an </table> tag.

There was not. The body of the document consisted entirely of tables. One table
ended with the beginning tag of the next table. The last one was terminated by
the </html> tag. Surprised me too, but Netscape had no trouble rendering it.

Thanks everyone for the help.

Andrew Robinson
---
Disclaimer: The opinions expressed are mine alone and do not represent the
views of America Online



Thu, 25 Jan 2001 03:00:00 GMT  
 
 [ 6 post ] 

 Relevant Pages 

1. need help for regular expression

2. Help with regular expression

3. Help with Regular expressions

4. Need help with regular expression

5. help with regular expressions

6. Help with regular expression

7. Please help with regular expression understanding

8. help on regular expression

9. Need help with regular expression

10. Need help with regular expression

11. help writing regular expression

12. need help for regular expression

 

 
Powered by phpBB® Forum Software