re findall mod for issue of side effects 
Author Message
 re findall mod for issue of side effects

I've changed a line, added a line, and added a 'grammaticGrouping' parameter
to the definition of RegexObj.findall (the parameter gets modified in the
user api, also).  This appears to change the behavior of findall to be
consistent with what I was desiring; that is, a way to specify grouping in a
regex pattern, without returning tuples in the findall result.

An example:

Quote:
>>> s='..abcabcxyz..'

# Try a simple pattern
Quote:
>>> r=re.compile('abcxyz')
>>> r.findall(s)

['abcxyz']

# Now add some grouping to the pattern

Quote:
>>> r=re.compile('(abc)*(xyz)*')
>>> r.findall(s)

[('', ''), ('', ''), ('abc', 'xyz'), ('', ''), ('', ''), ('', '')]
# Wow, that changed the return value dramatically

# Set the grammaticGrouping flag to 1, using the patched findall code

Quote:
>>> r.findall(s,1)

['abcabcxyz']
# That's consistent with the result type from the first pattern.

Does anybody else see this to be as useful as I do?

AH

####### patched code follows  ##########
    def findall(self, source, grammaticGrouping=0):    #new
grammaticGrouping parameter
        """Return a list of all non-overlapping matches in the string.

        If one or more groups are present in the pattern and
        grammaticGrouping is false, return a
        list of groups; this will be a list of tuples if the pattern
        has more than one group.

        Empty matches are included in the result.

        """
        pos = 0
        end = len(source)
        results = []
        match = self.code.match
        append = results.append
        while pos <= end:
            regs = match(source, pos, end, 0)
            if not regs:
                break
            i, j = regs[0]
            rest = regs[1:]
            if not rest or grammaticGrouping:   #new: changed from 'if not
rest:'
                gr = source[i:j]
            elif len(rest) == 1:
                a, b = rest[0]
                gr = source[a:b]
            else:
                gr = []
                for (a, b) in rest:
                    gr.append(source[a:b])
                gr = tuple(gr)
            #was: append(gr)
            if gr or not grammaticGrouping:     #new: added this line
                append(gr)                      #new: indented
            pos = max(j, pos+1)
        return results



Fri, 04 Jul 2003 12:17:02 GMT  
 re findall mod for issue of side effects
[Andrew Henshaw]

Quote:
> I've changed a line, added a line, and added a 'grammaticGrouping'
> rameter to the definition of RegexObj.findall (the parameter gets
> modified in the user api, also).  This appears to change the
> behavior of findall to be consistent with what I was desiring; that
> is, a way to specify grouping in a regex pattern, without returning
> tuples in the findall result.

Except that's what non-capturing parens are for, in the context of
re.findall() and *everywhere else*.  Adding a unique wart to findall() is
probably a poor idea unless it's astonishingly useful.

Quote:
> An example:

> >>> s='..abcabcxyz..'

> # Try a simple pattern
> >>> r=re.compile('abcxyz')
> >>> r.findall(s)
> ['abcxyz']

> # Now add some grouping to the pattern
> >>> r=re.compile('(abc)*(xyz)*')
> >>> r.findall(s)
> [('', ''), ('', ''), ('abc', 'xyz'), ('', ''), ('', ''), ('', '')]
> # Wow, that changed the return value dramatically

Sure did.  But your pattern after "grouping" is also radically different in
another way:  it can match an empty string, where your original pattern
could not.  And a pattern that can match nothing everywhere is a very
strange pattern for findall() (what is it you're trying to find then?  a
bunch of nothings?  that's what you're *telling* it to find).  You do
strange things, you get strange results.

A pattern matching what it appears you *intended* to search for here:

Quote:
>>> r = re.compile('(?:abc)+(?:xyz)*|(?:xyz)+')
>>> r.findall(s)
['abcabcxyz']

That is, don't hand findall a pattern than matches empty strings, and it
won't return empty matches.

Quote:
> ...
> Does anybody else see this [grammaticGrouping] to be as useful
> as I do?

Sorry, I don't:  I see it as misusing findall(), and then adding a wart to
cover that up.  But then I'm always generous in my assessments <wink>.

More generally useful would be a new flag on regexp compilation meaning "all
my parens are non-capturing".  Then that part of it could be enjoyed by all
uses of regexps, not just findall.  I don't see a need for that, but it
wouldn't be particularly damaging.

If you're going to ask findall() to match empty strings, though, filter 'em
out yourself.

cruel-but-fair-ly y'rs  - tim



Fri, 04 Jul 2003 16:07:43 GMT  
 re findall mod for issue of side effects


Quote:
> Except that's what non-capturing parens are for, in the context of
> re.findall() and *everywhere else*.  Adding a unique wart to findall() is
> probably a poor idea unless it's astonishingly useful.

...snip...

Quote:

> Sure did.  But your pattern after "grouping" is also radically different
in
> another way:  it can match an empty string, where your original pattern
> could not.  And a pattern that can match nothing everywhere is a very
> strange pattern for findall() (what is it you're trying to find then?  a
> bunch of nothings?  that's what you're *telling* it to find).  You do
> strange things, you get strange results.

> A pattern matching what it appears you *intended* to search for here:

> >>> r = re.compile('(?:abc)+(?:xyz)*|(?:xyz)+')
> >>> r.findall(s)
> ['abcabcxyz']

> That is, don't hand findall a pattern than matches empty strings, and it
> won't return empty matches.

Introducing empty matches was by mistake.  I should have left the pattern at
(abc)+xyz

There is a problem with (?:) that I brought up in 'Using re -side effects or
misunderstanding'.

...snip...

Quote:
> > Does anybody else see this [grammaticGrouping] to be as useful
> > as I do?

> Sorry, I don't:  I see it as misusing findall(), and then adding a wart to
> cover that up.  But then I'm always generous in my assessments <wink>.

> More generally useful would be a new flag on regexp compilation meaning
"all
> my parens are non-capturing".  Then that part of it could be enjoyed by
all
> uses of regexps, not just findall.  I don't see a need for that, but it
> wouldn't be particularly damaging.

This is what I had suggested (well, maybe not, see below) in yesterday's
take on this subject (see:
'Using re -side effects or misunderstanding') and would be my preferred
design.  I looked at the code and became worried that the effect of that
flag would have deep consequences that I wasn't going to foresee in my quick
examination.  Therefore, I thought I'd limit the 'wart', as you call it, to
the area that I was particularly interested, for demonstration purposes.

As to your suggestion (a new flag on regexp compilation meaning "all my
parens are non-capturing"), I'd still like to retain the ability to use the
non-capturing flag to exclude portions from the return string.  This may be
what you're stating, but I'd like the flag to indicate that parens are for
grammatical grouping - they do not  force a tuple return.

Thus,
s='..abcxyz..'
r=re.compile('(ab)+(?:c)(xyz)+')
r.findall(s)

would return

['abxyz']

(I should put a fractional wink in here about null strings)

Quote:

> If you're going to ask findall() to match empty strings, though, filter
'em
> out yourself.

Yes, I agree.  That bit of code shouldn't be in there.  I realized that late
last night, when I was playing with the patch.  Also, the patch is flawed in
that it doesn't handle the '(?:)' type of parens correctly.  The problems
one generates when one tries to rush a 'product' out the door.

Quote:

> cruel-but-fair-ly y'rs  - tim

Not cruel at all.

AH



Fri, 04 Jul 2003 21:11:43 GMT  
 re findall mod for issue of side effects
On Mon, 15 Jan 2001 08:11:43 -0500,

Quote:
>As to your suggestion (a new flag on regexp compilation meaning "all my
>parens are non-capturing"), I'd still like to retain the ability to use the
>non-capturing flag to exclude portions from the return string.  This may be

This problem is because .findall() only returns the string
corresponding to all of a match.  Someone else has suggested a
.findall() variant which returned the actual match objects, so then
you could loop over them all and construct whatever string you like.
It seems potentially cleaner to invent a name or interface for such a
variant and get it accepted.  (Hurry!  Maybe it can still get into
2.1!)

--amk



Sat, 05 Jul 2003 12:21:34 GMT  
 
 [ 4 post ] 

 Relevant Pages 

1. Dialogs and side effects

2. Pattern for side effects

3. Skipping print preview side effect ArnĂ³r

4. Avoiding functions with side effects

5. INI file side effects

6. Functions without side effects (was Old confusion)

7. Wrapping a side-effecting function library

8. Side-effect free functions

9. Forth's popularity, side-effect

10. local variables don't cause side effects

11. One advantage of computing by side-effect

12. Side effects in pure functional languages

 

 
Powered by phpBB® Forum Software