Multiline regexps with matches 
Author Message
 Multiline regexps with matches

I have a loop that parses an HTML file mostly consisting of a huge
definition list.  There's various data in each definition, mostly
formatted in a precise fashion.  I'm writing a PERL script to move that
data to a tab-delimited file for import into a databse like I should have
done in the first place.  Most of the data moves across easily however two
of the "fields" in my data can (sometimes) cross multiple lines.
Specifically data can look like:

Products: BBS in a Box CD, Trilogy CD, Music Madness CD,
Art Madness CD, Photo CDs
<P>

AMUG is a non-profit user group which makes CDs, provides hands on
training, and is the first user group to be an internet provider. They
have a monthly newsletter, an internet ftp and web site, and the biggest
Newton site!
<P>
<hr>

I need to find all the text between "Products: " and "<P>" and write that into
a record in my output file.  I've suceeded in this task for all previous
fields but they all had data on one line only.  The products field data
can cross an arbitrary number of lines.  I'd like something like

   if ($record =~ /Products\: (.*)\<P\>/) {print TAB $1};

but the "." doesn't allow the matched text to cross the newline
character.  Is there a standard way to do this?  Any suggestions are much
appreciated.

--
Elliotte Rusty Harold      Dept. of Mathematics




Thu, 31 Jul 1997 05:25:51 GMT  
 Multiline regexps with matches

Quote:
> I need to find all the text between "Products: " and "<P>" and write that into
> a record in my output file.  I've suceeded in this task for all previous
> fields but they all had data on one line only.  The products field data
> can cross an arbitrary number of lines.  I'd like something like

>    if ($record =~ /Products\: (.*)\<P\>/) {print TAB $1};

> but the "." doesn't allow the matched text to cross the newline
> character.  Is there a standard way to do this?  Any suggestions are much
> appreciated.

I _think_ I can solve this: replace the period by a set construct consisting
of all characters except for one that you can _guarantee_ will not occur in

will not occur).  The resulting set appears, from my tests, to include
end-of-line.

                                        Philip Taylor, RHBNC

(Using Perl5 under AXP/VMS and MS/DOS).



Wed, 06 Aug 1997 02:19:47 GMT  
 Multiline regexps with matches

Quote:
> Specifically data can look like:

> Products: BBS in a Box CD, Trilogy CD, Music Madness CD,
> Art Madness CD, Photo CDs
> <P>

> I need to find all the text between "Products: " and "<P>" and write that into
> a record in my output file.  I've suceeded in this task for all previous
> fields but they all had data on one line only.  The products field data
> can cross an arbitrary number of lines.  I'd like something like

>    if ($record =~ /Products\: (.*)\<P\>/) {print TAB $1};

Enabling multiline is not enough, you should use also the input record separator
which tells perl upto which 'separator' to split the input file
sso setting

$/="<p>"

will get all the lines of the record into $_ (special cases for first and last
record iss probably needed)

instead of '.' in the regexp use (.|\n) to match also newlines

I myself don't work with HTML files, but there are so many posting regarding
the parsing of these files, and more or less are all alike, maybe someone should
come up with a perl package for this ?

---
---------------------------------------------------------------------------

                                            Tel    : (972) 9 594210
---------------------------------------------------------------------------



Wed, 06 Aug 1997 17:25:54 GMT  
 
 [ 3 post ] 

 Relevant Pages 

1. Multiline regexps and line numbers

2. multiline, multi pattern match

3. multiline regex matching

4. bug in anchored, multiline pattern match

5. pattern matching and multilines (sort of)

6. New User Q: MultiLine Match

7. multiline matching fun

8. Matching and removing multiline block in perl 1-liner

9. matching multiline C constants

10. multiline match in perl 5

11. Multiline Matching

12. Multiline pattern matching with command line invocation

 

 
Powered by phpBB® Forum Software