
Extracting strings from postscript in a single regexp?
|> >I would like to come up with a way to match strings in a postscript
|> >file (things like ( ...literal string.... )) and I would like to do it
|> >with a single regexp.
|> >
|> >But:
|> >i) Strings can span several lines
|> >
|> >ii) they can contain escaped characters, notably "\(" and "\)" to
|>
|> As for spanning multiple lines: In perl4 say $* = 1 to enable multiline
|> matching. In perl5 use m//s to switch on multiline matching locally.
As far as I know, perl4's $* only affects the meaning of ^ and $... one
still has to worry about getting the entire text of a string into $_ in the
first place.
|> But you have a worse problem: Postscript allows balanced parenthesis
|> to appear unquoted in a "()"-delimited string.
Ouch.
|> to recognize balanced parenthesis which currently just can't be done
|> or so I believe.
Anno's correct, but since Sandro mentioned that parens might be escaped,
maybe he knows that in his input, they'll never be unescaped (??). I think
many programs that generate postscript will just blindly escape all such
parens [my a2ps filter does]
Someone's already given a perl5 answer, so I'll see what I can come up
with perl4.
First of all, the simple regex to match the stuff inside parens would
be along the lines of:
print $1 while m/\(((\\.|[^()\\])*)\)/g;
AA( BBB CCCCCCC )ZZ
AA,ZZ -- opening and closing raw parens.
BB -- anything escaped [ including \\, \(, and \) ]
CC -- anything not a paren or a backslash
This will fail miserably if there are unquoted parens within a string, or
if there are abberations in the input stream.
Now, if you're reading by lines and doing other processing, it may not be
practicle to suck in the whole file to apply that to. But you've still got
to worry about multiple-line strings.
One way to address this problem would be to read a line, and continue
reading lines until the number of unescaped open parens was the same as
unescaped close parens:
m/(^\(|[^\\]\()/ != m/(^\)|[^\\]\))/;
AAA BBBBBBB CCC DDDDDDD
AA - a raw unescaped open paren at the start of a line
BB - a raw unescaped open paren within a line
CC - a raw unescaped close paren at the start of a line
DD - a raw unescaped close paren within a line
Using it all:
while (<>) {
$_ .= <> while m/(^\(|[^\\]\()/ != m/(^\)|[^\\]\))/;
print $1, "\n" while m/\(((\\.|[^()\\])*)\)/g;
}
Hope this helps,
*jeffrey*
Note: I'm on vacation from tomorrow until 9/28, so I'll most likely not see
followups.
-------------------------------------------------------------------------
See my Jap/Eng dictionary at http://www.omron.co.jp/cgi-bin/j-e
or http://www.cs.cmu.edu:8001/cgi-bin/j-e