Extracting strings from postscript in a single regexp? 
Author Message
 Extracting strings from postscript in a single regexp?

Hello, everybody.

I wonder if the regexp wizards inhabiting this group have a solution
for the following.

I would like to come up with a way to match strings in a postscript
file (things like ( ...literal string.... )) and I would like to do it
with a single regexp.

But:
i)  Strings can span several lines

ii) they can contain escaped characters, notably "\(" and "\)" to
include quoted delimiters, and "\\" to include quoted backslashes so
sequences like "\\\\\\\\\\\\\\\\\\\)" are conceivable (silly, maybe,
but conceivable.

So, all I come up with when I think about it, is code that goes
through the thing a character at a time, sets a few flags, then says:

        while (InString) {
                <get the next char>
                blah,blah...
        }

I think you get the idea.
It works, but seems silly, given the whealth of regular expressions
that perl gives me...

Regards
                Alessandro Forghieri
--
Alessandro Forghieri  
Control Data Italy                               phone: ++39 2 2174253
Palazzo Bernini Milano 2                         fax:   ++39 2 26414187



Fri, 21 Feb 1997 15:55:16 GMT  
 Extracting strings from postscript in a single regexp?

Quote:

>Hello, everybody.

>I wonder if the regexp wizards inhabiting this group have a solution
>for the following.

>I would like to come up with a way to match strings in a postscript
>file (things like ( ...literal string.... )) and I would like to do it
>with a single regexp.

>But:
>i)  Strings can span several lines

>ii) they can contain escaped characters, notably "\(" and "\)" to
>include quoted delimiters, and "\\" to include quoted backslashes so
>sequences like "\\\\\\\\\\\\\\\\\\\)" are conceivable (silly, maybe,
>but conceivable.

>So, all I come up with when I think about it, is code that goes
>through the thing a character at a time, sets a few flags, then says:

>    while (InString) {
>            <get the next char>
>            blah,blah...
>    }

>I think you get the idea.
>It works, but seems silly, given the whealth of regular expressions
>that perl gives me...

As for spanning multiple lines:  In perl4 say $* = 1 to enable multiline
matching.  In perl5 use m//s to switch on multiline matching locally.

But you have a worse problem:  Postscript allows balanced parenthesis
to appear unquoted in a "()"-delimited string.  So you'd need a regex
to recognize balanced parenthesis which currently just can't be done,
or so I believe.

Maybe you could use something like

to sort of tokenize your input.  This would save you from having to visit
each character while giving you each of "(", ")", "|" in a separate string
to deal with.

Anno



Fri, 21 Feb 1997 18:23:22 GMT  
 Extracting strings from postscript in a single regexp?


Quote:

>Hello, everybody.

>I wonder if the regexp wizards inhabiting this group have a solution
>for the following.

>I would like to come up with a way to match strings in a postscript
>file (things like ( ...literal string.... )) and I would like to do it
>with a single regexp.

>But:
>i)  Strings can span several lines

>ii) they can contain escaped characters, notably "\(" and "\)" to
>include quoted delimiters, and "\\" to include quoted backslashes so
>sequences like "\\\\\\\\\\\\\\\\\\\)" are conceivable (silly, maybe,
>but conceivable.

> perl -0 -lne '


'

I'm quite confident, the above goes the right direction. you might
have to postprocess the array's elements, as I find

    atend
    atend
    atend
    atend
    c
    4.0
    Frame product version does not match ps_prolog!

in the output, and there are all the \050 and \336 not yet translated,
but the rest seems pretty usable.

m/(?=\(((?:\\\d\d\d|\\.|[^\\])*?)\))/g

  |  |  |  |        |   |     |  ^^ closing paren
  |  |  |  |        |   |     ^^ match only till the first of them
  |  |  |  |        |   ^^^^^ anything not being a backslash
  |  |  |  |        ^^^ anything escaped by a backslash
  |  |  |  ^^^^^^^^ octal digits escaped by a backslash
  |  |  ^^ don't count that paren
  |  | ^ here we start
  |  ^^ opening paren
  ^^ don't count that paren

My machine has a lot of work with it. Let me know the problems I've
left for you...

--andreas



Fri, 21 Feb 1997 21:32:10 GMT  
 Extracting strings from postscript in a single regexp?

|> >I would like to come up with a way to match strings in a postscript
|> >file (things like ( ...literal string.... )) and I would like to do it
|> >with a single regexp.
|> >
|> >But:
|> >i)  Strings can span several lines
|> >
|> >ii) they can contain escaped characters, notably "\(" and "\)" to
|>
|> As for spanning multiple lines:  In perl4 say $* = 1 to enable multiline
|> matching.  In perl5 use m//s to switch on multiline matching locally.

As far as I know, perl4's $* only affects the meaning of ^ and $... one
still has to worry about getting the entire text of a string into $_ in the
first place.

|> But you have a worse problem:  Postscript allows balanced parenthesis
|> to appear unquoted in a "()"-delimited string.

Ouch.

|> to recognize balanced parenthesis which currently just can't be done
|> or so I believe.

Anno's correct, but since Sandro mentioned that parens might be escaped,
maybe he knows that in his input, they'll never be unescaped (??). I think
many programs that generate postscript will just blindly escape all such
parens [my a2ps filter does]

Someone's already given a perl5 answer, so I'll see what I can come up
with perl4.

First of all, the simple regex to match the stuff inside parens would
be along the lines of:

     print $1 while m/\(((\\.|[^()\\])*)\)/g;
                      AA( BBB CCCCCCC  )ZZ

    AA,ZZ  -- opening and closing raw parens.
    BB -- anything escaped [ including \\, \(, and \) ]
    CC -- anything not a paren or a backslash

This will fail miserably if there are unquoted parens within a string, or
if there are abberations in the input stream.

Now, if you're reading by lines and doing other processing, it may not be
practicle to suck in the whole file to apply that to. But you've still got
to worry about multiple-line strings.

One way to address this problem would be to read a line, and continue
reading lines until the number of unescaped open parens was the same as
unescaped close parens:

    m/(^\(|[^\\]\()/ != m/(^\)|[^\\]\))/;
       AAA BBBBBBB         CCC DDDDDDD

 AA - a raw unescaped open paren at the start of a line
 BB - a raw unescaped open paren within a line
 CC - a raw unescaped close paren at the start of a line
 DD - a raw unescaped close paren within a line

Using it all:

   while (<>) {
      $_ .= <> while m/(^\(|[^\\]\()/ != m/(^\)|[^\\]\))/;
      print $1, "\n" while m/\(((\\.|[^()\\])*)\)/g;
   }

Hope this helps,
    *jeffrey*

Note: I'm on vacation from tomorrow until 9/28, so I'll most likely not see
followups.
-------------------------------------------------------------------------

See my Jap/Eng dictionary at http://www.omron.co.jp/cgi-bin/j-e
                          or http://www.cs.cmu.edu:8001/cgi-bin/j-e



Sat, 22 Feb 1997 09:54:52 GMT  
 Extracting strings from postscript in a single regexp?
: m/(?=\(((?:\\\d\d\d|\\.|[^\\])*?)\))/g
:
:   |  |  |  |        |   |     |  ^^ closing paren
:   |  |  |  |        |   |     ^^ match only till the first of them
:   |  |  |  |        |   ^^^^^ anything not being a backslash
:   |  |  |  |        ^^^ anything escaped by a backslash
:   |  |  |  ^^^^^^^^ octal digits escaped by a backslash
:   |  |  ^^ don't count that paren
:   |  | ^ here we start
:   |  ^^ opening paren
:   ^^ don't count that paren

Hmm.

    m {
        (?xg)                           (?# extended, global)
        (?=                             (?# begin positive assertion)
            \(                          (?# opening paren)
                (                       (?# here we start)
                    (?:                 (?# don't count that paren)
                        \\ \d\d\d       (?# octal digits escaped by backslash)
                        |               (?# or)
                        \\ .            (?# anything escaped by a backslash)
                        |               (?# or)
                        [^\\]           (?# anything not being a backslash)
                    )*?                 (?# match only till the first of them)
                )                       (?# here we end)
            \)                          (?# closing paren)
        )                               (?# end assertion)
    }

Regular expressions will never be the same...

Larry



Sun, 23 Feb 1997 07:44:08 GMT  
 Extracting strings from postscript in a single regexp?

: : m/(?=\(((?:\\\d\d\d|\\.|[^\\])*?)\))/g
: :
: :   |  |  |  |        |   |     |  ^^ closing paren
: :   |  |  |  |        |   |     ^^ match only till the first of them
: :   |  |  |  |        |   ^^^^^ anything not being a backslash
: :   |  |  |  |        ^^^ anything escaped by a backslash
: :   |  |  |  ^^^^^^^^ octal digits escaped by a backslash
: :   |  |  ^^ don't count that paren
: :   |  | ^ here we start
: :   |  ^^ opening paren
: :   ^^ don't count that paren

: Hmm.

:     m {
:       (?xg)                           (?# extended, global)
:       (?=                             (?# begin positive assertion)
:           \(                          (?# opening paren)
:               (                       (?# here we start)
:                   (?:                 (?# don't count that paren)
:                       \\ \d\d\d       (?# octal digits escaped by backslash)
:                       |               (?# or)
:                       \\ .            (?# anything escaped by a backslash)
:                       |               (?# or)
:                       [^\\]           (?# anything not being a backslash)
:                   )*?                 (?# match only till the first of them)
:               )                       (?# here we end)
:           \)                          (?# closing paren)
:       )                               (?# end assertion)
:     }

: Regular expressions will never be the same...

Wow! comments embedded in regular expressions.

I take it that ?# is now a valid regular expression comment delimiter?

Paul



Sun, 23 Feb 1997 18:00:21 GMT  
 Extracting strings from postscript in a single regexp?

Quote:

>Hmm.

>    m {

[ several lines of potentialy decyperable regular expression syntax :-)]

Quote:
>    }

>Regular expressions will never be the same...

>Larry

Such a complex structure deserves a bigger name name than 'm'
BTW how does one use this wrt s/RE/.../?

chris



Mon, 24 Feb 1997 12:45:09 GMT  
 Extracting strings from postscript in a single regexp?
:
:     m {
:       (?xg)                           (?# extended, global)
:       (?=                             (?# begin positive assertion)
:           \(                          (?# opening paren)
:               (                       (?# here we start)
:                   (?:                 (?# don't count that paren)
:                       \\ \d\d\d       (?# octal digits escaped by backslash)
:                       |               (?# or)
:                       \\ .            (?# anything escaped by a backslash)
:                       |               (?# or)
:                       Y?\\                (?# anything not being a backslash)
:                   )*?                 (?# match only till the first of them)
:               )                       (?# here we end)
:           \)                          (?# closing paren)
:       )                               (?# end assertion)
:     }
:
: Regular expressions will never be the same...
:
: Larry

Why not make the 'x' option make '#' magical as well within the the regular
expressions, so that it will work as it does normally, i.e. a comment
to the end of the line. Then the (?# ...) stuff would at least be
unnecessary together with the x option.

--
                                                        ___   ___
                                                      /  o  \   o \
Dov Grobgeld                                         ( o  o  ) o   |
The Weizmann Institute of Science, Israel             \  o  /o  o /
"Where the tree of wisdom carries oranges"              | |   | |



Thu, 27 Feb 1997 05:23:35 GMT  
 Extracting strings from postscript in a single regexp?
: Wow! comments embedded in regular expressions.
:
: I take it that ?# is now a valid regular expression comment delimiter?

Well, (?#...) is.  The parens are required.  It would be too confusing
to allow /foo|#? a comment|bar/ to match /foo||bar/;

The (?xg) stuff works too.  And (?six) works, though (?sex) doesn't.

Larry



Fri, 28 Feb 1997 09:59:35 GMT  
 
 [ 9 post ] 

 Relevant Pages 

1. How to extract the toplevel-domain of a domainame from a string with regexp

2. How to extract text from PostScript file

3. changing a single quote string to a double quote string

4. split a string not only by a single char but also by a string

5. Converting single quoted string to double quoted string

6. extract string from another string

7. regular expression to extract a string between 2 other strings

8. Extract string from a string

9. PostScript::Font -- tools for PostScript TypeX and TrueType fonts

10. ANNOUNCE: PostScript::Font -- tools for PostScript fonts

11. extracting a chunk with regexp?

12. perl regexp to extract table out of HTML

 

 
Powered by phpBB® Forum Software