How to do this neat thing in Python?
Hello everyone. I have a question which may (or perhaps not?) be an
interesting challenge for some of the python experts out there. It
involves how to nicely abstract out a commonly used loop in my programs.
Basically, I do a lot of processing on files (sometimes large files)
which consist of a large number of records. I use Python regexes to
extract out the fields from these records, and process each record in
turn. So, what I usually do is something like
text = <code to read in all of the file as a single string>
matcher = regex.symcomp(...my pattern...)
nextpos = matcher.match(text)
while nextpos > -1:
<process the data obtained from the match>
nextpos = matcher.match(text, nextpos)
While this code certainly isn't terribly complex, it is a little
error-prone; there are things that don't really need to be exposed, such
as the explicit use of indexes into the string and so forth. Now it
would be easy enough to come up with a class, say RecordMatcher, that
would handle a lot of stuff transparently, so I could just say something
like this:
rmatcher = RecordMatcher(...my pattern..., text)
while rmatcher:
data = rmatcher.data
<process the data obtained from the match>
and that is certainly nicer. But, what I'd really like to do is to be
able to treat rmatcher as a list, and say something like
rmatcher = RecordMatcher(...my pattern..., text)
foreach record in rmatcher:
<process that record>
The problem is that, from what I know of Python, I can't emulate a
sequence type efficiently. For example, I can't define a __len__ method
in the RecordMatcher class, without doing all of the matching first,
something I'd prefer to avoid because that implies I have to retain all
of the "found" records in memory. I just want to do a single pass over
the data, examining records one at a time.
In fact, I'd really like to figure out a way to do this without reading
in the entire file in the first place, but that's something for future
thought...
Thanks,
Ken McDonald
========================================
Kenneth McDonald
Genome Sequencing Center
Washington University School of Medicine
Phone: 314-286-1831
========================================