My first regex.error: match failure 
Author Message
 My first regex.error: match failure

This is spooky and happens both on 1.4b3 and 1.3.  
I have used the following regex to match doc strings but it blows up
when I try to work with the following large doc string. Is my
expression one of those pathological cases or what??? If anyone
has a better regex for doc strings btw I'd love to see it.

Python 1.4b3 (Sep 16 1996)  [GCC 2.7.2]
Copyright 1991-1996 Stichting Mathematisch Centrum, Amsterdam

Quote:
>>> import regex
>>> s = open('funny').read()
>>> len(s)
2746
>>> doc_string_re = regex.compile('\("""\([^"]\|"[^"]\|""[^"]\)*"""\)')
>>> doc_string_re.search(s)

Traceback (innermost last):
  File "<stdin>", line 1, in ?
regex.error: match failure
Quote:
>>> doc_string_re.search(s[:2700])

Traceback (innermost last):
  File "<stdin>", line 1, in ?
regex.error: match failure
Quote:
>>> doc_string_re.search(s[:2000])

-1

See it you can reproduce this. The following is the text from the
funny file:
----snip-----
#################
class Document:
    """Primary container class for an HTML document.

    Single optional string argument for the path to a resource file
    used to specify document parameters. This helps minimize the need
    for subclassing from this class. Keyword parameters may be used
    for any of the following class attributes. See *HTMLtest.py* for
    example usage.

    Class instance attributes and keyword arguments

        base -- object of the Base class
        meta -- object of the Meta class
        cgi  -- if non zero will issue a mime type of text/html
        logo -- ('filename', width, height)  All images are specified
                 with a tuple of string, int, int. If the size of the
                 graphic is unknown, use 0, 0.  This one is the little
                 graphic on the footer of each page.
        banner -- ('filename', width, height) Banner graphic at
                 the top of page.
        title --  string to be used as the document title.
        subtitle -- string to be used as the document subtitle.
                 If non-nil, this string will be used for the doc title
                 instead of title.
        author -- String used in the copyright notice
        email -- Email address for feedback mailto: tag
        zone -- string used to label the time zone if datetime
                 is used. By default not used.
        bgcolor -- Color string (can use variables from
                 HTMLcolors.py)
        background -- string filename of a graphic used as the
                 doc background.
        textcolor -- Color string used for text.  (can use
                 variables from HTMLcolors.py)
        linkcolor -- Color string used for hyperlinked text.
        vlinkcolor -- Color string used for visited hypertext.
        alinkcolor -- Color string used for active hypertext.
        place_nav_buttons -- Flag to enable/disable the use of
                 navigation buttons.
                 Default is on. Set to 0 to disable.
        blank -- Image tuple for the transparent spacer gif
        prev -- Image tuple for the Previous Page button
        next -- Image tuple for the Next Page button
        top -- Image tuple for the Top of Manual button
        home -- Image tuple for the site Home Page button
        goprev -- URL string for the prev button
        gonext -- URL string for the next button
        gotop  -- URL string for the top button
        gohome -- URL string for the home button
        scripts -- a single or list of Script objects to be included in the <HEAD>
        onLoad -- Script, which is executed when the document is loaded
        onUnload -- Script, which is executed when the document is unloaded
    """



Tue, 16 Mar 1999 03:00:00 GMT  
 My first regex.error: match failure

   I have used the following regex to match doc strings but it blows up
   when I try to work with the following large doc string.

It failed for me also, however pregex worked just fine:

    import pregex
    s = open('funny').read()
    print len(s)
    doc_string_re = pregex.symcomp('(<str>"""([^"]|"[^"]|""[^"])*""")')
    result = doc_string_re.match(s, 2)
    print result.substring('str')

Running the above script resulted in the following elided output:

    2746
    """Primary container class for an HTML document.

    ...
            scripts -- a single or list of Script objects to be included in the <HEAD>
            onLoad -- Script, which is executed when the document is loaded
            onUnload -- Script, which is executed when the document is unloaded
        """

I realize Ka-Ping Yee has noted performance issues with at least one POSIX
regex implementation, however it certainly appears more functionally robust
to me than the regex functions that are included with Python.

Skip Montanaro     |   Musi-Cal: http://concerts.calendar.com/

(518)372-5583      |   Python: http://www.python.org/



Wed, 17 Mar 1999 03:00:00 GMT  
 My first regex.error: match failure

Quote:
> [robin]
> ...
> I have used the following regex to match doc strings
> [ '\("""\([^"]\|"[^"]\|""[^"]\)*"""\)' ]
> but it blows up when I try to work with the following large
> doc string.

Same here (PythonWin 1.3, Win95):  works iff the number of characters matched
by the  "*" expression doesn't get bigger than about 2048.  Here's the same
problem in a simpler setting:

import regex
test = regex.compile( 'y\(x\|z\)*y' )
for i in range(10000):
    try:
        start = test.search( 'y' + 'x' * i + 'y' )
    except regex.error:
        print 'first blowup with', i, 'xs'
        break

That first blows up when i == 2050.

Worse, if in your original regexp I use the "translate" argument when
compiling, to get rid of the negated character classes in the regexp, for some
range of sizes it doesn't blow up but silently (& incorrectly) fails to match.

Quote:
> ...
> If anyone has a better regex for doc strings btw I'd love
> to see it.

A problem with regexps is that they're forever being used for things they're
not good at.  In this case, a "proper" regexp for docstrings should handle
both forms of triple quotes, and realize that not every instance of a triple
quote "counts".  E.g., here's something mishandled by the regexp above:

def some_doc_func( s ):
    """Finds doc strings, that is triple-quoted strings of the
    form:

        \"""This is sample documentation.
        \"""

    at the start of a function
    """

That can be fixed too, but in the end a "proper" regexp for the problem is a
mess -- and Python's regex module will {*filter*}on it anyway.

But the whole problem is easy enough to solve using very simple regexps, if
you can brush off the cruel taunting of your one-liner peers <wink>:

import regex

not_backslash = "[^\\\\]"
triple_single = "'''"
triple_double = '"""'
_doc_start_re = regex.compile(
    "\(^\|" + not_backslash + "\)" # bol or not backslash
    + "\(" + triple_single + "\|" + triple_double + "\)" )
single_re = not_backslash + triple_single
double_re = not_backslash + triple_double
_triple_re = { triple_single : regex.compile(single_re),
               triple_double : regex.compile(double_re) }

del regex, not_backslash, triple_single, triple_double, \
    single_re, double_re,

# return (b,e) s.t. s[b:e] is the leftmost triple-quoted
# string in s (including the quotes), or None if s
# doesn't contain a triple-quoted string
def find_docstring( s ):
    if _doc_start_re.search( s ) < 0:
        return None
    startquote, endquote = _doc_start_re.regs[2]
    quotestring = s[startquote:endquote] # """ or '''
    quotefinder = _triple_re[quotestring]
    if quotefinder.search( s, endquote - 1 ) < 0:
        return None
    return startquote, quotefinder.regs[0][1]

That won't bust anyone's (human's or search engine's <wink>) brain, and works
fine & fast with huge docstrings (largest I tried was over 3.5Mb).

The general idea (make *some* progress, then start over from there) is
generally useful for avoiding messy regexps, and especially when wrestling
with regexp packages (like Python's) that don't support a "shortest match"
option.

sometimes-easier-to-solve-a-whole-problem-simply-
    than-part-of-it-convolutedly-ly y'rs  - tim


not speaking for Dragon Systems Inc.



Wed, 17 Mar 1999 03:00:00 GMT  
 
 [ 3 post ] 

 Relevant Pages 

1. regex.error: match failure

2. regex.error: match failure

3. regex error : match failure

4. regex: how to limit pattern match to first occurance

5. "match failure" error

6. Store regex match in a variable

7. regex matching and text moving

8. Regex \w does not match underscore

9. RegEx ASCII character matching

10. Need regex help (or bug in match)

11. Scope of regex match $variables

12. How to Convert String to Regex to Perform Exact Match

 

 
Powered by phpBB® Forum Software