
Help with regular expression
[Posted to comp.lang.perl.misc and copy mailed.]
Quote:
> I'm trying to parse table rows in a pathological html document. The document
> doesn't have any closing tags for <table>, <tr>, <th>, or <td>. This is what I
I'll bet it has </table> or it wouldn't display at all.
Quote:
> tried that did not work:
> while ( $html =~ m#(<tr>.*?)<tr>)#gs ) {
> # do stuff
> }
You couldn't have tried exactly that, because the parentheses don't
match. However...
Quote:
> It picks up every other row because I have run past the beginning of the row
> tag with the second "<tr>". Is there a way to tell perl to 'put back' the four
> characters before continuing the match? How should I be doing this?
(?=...) means look but don't touch. So try this:
while ( $html =~ m#(<tr>.*?)(?=<tr>)#gis ) {
Now the parentheses match :-). I've added the /i modifier because you
want to ignore case for HTML tags.
How will you deal with the last row of the table (i.e., no following
<tr>)? You might want to change that look-ahead to (?=(<tr>|</table>)).
As I said, I'll bet there is an </table> tag.
HTH,
--
Larry Rosler
Hewlett-Packard Laboratories
http://www.*-*-*.com/