How to do this neat thing in Python? 
Author Message
 How to do this neat thing in Python?

Hello everyone. I have a question which may (or perhaps not?) be an
interesting challenge for some of the python experts out there. It
involves how to nicely abstract out a commonly used loop in my programs.

Basically, I do a lot of processing on files (sometimes large files)
which consist of a large number of records. I use Python regexes to
extract out the fields from these records, and process each record in
turn. So, what I usually do is something like

text = <code to read in all of the file as a single string>
matcher = regex.symcomp(...my pattern...)
nextpos = matcher.match(text)

while nextpos > -1:
        <process the data obtained from the match>
        nextpos = matcher.match(text, nextpos)

While this code certainly isn't terribly complex, it is a little
error-prone; there are things that don't really need to be exposed, such
as the explicit use of indexes into the string and so forth. Now it
would be easy enough to come up with a class, say RecordMatcher, that
would handle a lot of stuff transparently, so I could just say something
like this:

rmatcher = RecordMatcher(...my pattern..., text)
while rmatcher:
        data = rmatcher.data
        <process the data obtained from the match>

and that is certainly nicer. But, what I'd really like to do is to be
able to treat rmatcher as a list, and say something like

rmatcher = RecordMatcher(...my pattern..., text)
foreach record in rmatcher:
        <process that record>

The problem is that, from what I know of Python, I can't emulate a
sequence type efficiently. For example, I can't define a __len__ method
in the RecordMatcher class, without doing all of the matching first,
something I'd prefer to avoid because that implies I have to retain all
of the "found" records in memory. I just want to do a single pass over
the data, examining records one at a time.

In fact, I'd really like to figure out a way to do this without reading
in the entire file in the first place, but that's something for future
thought...

Thanks,
Ken McDonald

========================================
Kenneth McDonald
Genome Sequencing Center
Washington University School of Medicine

Phone: 314-286-1831
========================================



Mon, 24 Apr 2000 03:00:00 GMT  
 How to do this neat thing in Python?


Quote:
> and that is certainly nicer. But, what I'd really like to do is to be
> able to treat rmatcher as a list, and say something like

> rmatcher = RecordMatcher(...my pattern..., text)
> foreach record in rmatcher:
>    <process that record>

> The problem is that, from what I know of Python, I can't emulate a
> sequence type efficiently. For example, I can't define a __len__ method
> in the RecordMatcher class, without doing all of the matching first,
> something I'd prefer to avoid because that implies I have to retain all
> of the "found" records in memory. I just want to do a single pass over
> the data, examining records one at a time.

You don't need to emulate __len__ efficiently. For for to work, all
you need is an __getitem__ that returns things, and raises an
IndexError when you run out. Sure, it won't be a real list, but it'll
work for this. Iterators (promised for a future release) will do this
in an even cleaner manner.

Quote:
> In fact, I'd really like to figure out a way to do this without reading
> in the entire file in the first place, but that's something for future
> thought...

Once you have a class with an __getitem__ that hands back "the next
match", making it process only parts of the file is an obvious
exercise for the reader.

        <mike

--
Do NOT reply to the address in the From: header. Reply to mwm instead of
bouncenews at the same machine. Mail to bouncenews will be returned with
instructions on reaching me. Sending unsoliticed email I consider commercial
gives me permission to subscribe you to a mail list of my choice.



Mon, 24 Apr 2000 03:00:00 GMT  
 How to do this neat thing in Python?

Quote:

> The problem is that, from what I know of Python, I can't emulate a
> sequence type efficiently. For example, I can't define a __len__ method
> in the RecordMatcher class, without doing all of the matching first,
> something I'd prefer to avoid because that implies I have to retain all
> of the "found" records in memory. I just want to do a single pass over
> the data, examining records one at a time.

You don't need to define __len__ .
Long ago, after I coded up several ugly hack's that always returned
current-length+1 for length, Guido changed the 'for' loop protocol
to exit on an IndexError -- it does not use length at all:

The simplest iterator is infinite:

Quote:
>>> class Iter:

...     def __getitem__( self, i ): return i
...
Quote:
>>> iter = Iter()
>>> iter[3]
3
>>> for n in Iter():                  

...     if n > 10 : break
...     print n
...
0
1
2
3
4
5
6
7
8
9
10

Quote:

> In fact, I'd really like to figure out a way to do this without reading
> in the entire file in the first place, but that's something for future
> thought...

This was, in fact, one of the reasons, and one of the first uses
for this trick. I was working on Macs and PCs where I couldn't just
suck up 20MB of file into memory.

There are examples is several places: In Mark Lutz' book certainly,
and in the comp.lang.python mailing list archives, I'm sure.

Basically: just read in each line as you need it, and return IndexError
on end of file.


---|  Department of Molecular Physiology and Biological Physics  |---
---|  University of {*filter*}ia             Health Sciences Center  |---
---|  P.O. Box 10011            C{*filter*}tesville, VA  22906-0011  |---
All power corrupts and obsolete power corrupts obsoletely." - Ted Nelson



Mon, 24 Apr 2000 03:00:00 GMT  
 How to do this neat thing in Python?

Quote:
>>> Ken Mcdonald wrote
> The problem is that, from what I know of Python, I can't emulate a
> sequence type efficiently. For example, I can't define a __len__ method
> in the RecordMatcher class, without doing all of the matching first,

for the 'for i in obj', you don't need to define a __len__ method - you
simply define a __getitem__ method, and return IndexError when you've
finished.

Anthony



Tue, 25 Apr 2000 03:00:00 GMT  
 
 [ 4 post ] 

 Relevant Pages 

1. Newbie Question - Usual ways of doing things.

2. help w/ doing several things at once.

3. Top ten things people are doing with Forth.

4. Not doing things

5. Doing Things Professionally!

6. Playing a wav file while doing other things?

7. EAI - doing things between events

8. Doing the simplest thing with Tkinter Canvas

9. Doing things in one line?

10. Any one doing the similar thing?

11. packing doing unwanted things

12. Doing two things at once in TK

 

 
Powered by phpBB® Forum Software