search/replace in long files 
Author Message
 search/replace in long files

I've been playing with the search/replace facilities in the re module,
and I've run up against an intersting problem.  The thing is, when
replacing with a regular expression, you don't necessarily know a
priori what the boundaries are for the regular expression
replacement.  One thing you can do is load the _entire_ file and do
the search and replace on the whole big file.  But that's unacceptable
for really big files.

The normal thing is to load the file up in chunks, like lines, and do
the searches on the lines.  But that makes it impossible to capture
regular expressions that cross line boundaries.   So, if you can't
load up the whole file, and you don't have any particular useful
boundaries which you know that the regular expression is not going to
cross, what do you do?  

y

--
|--------/            Yaron M. Minsky              \--------|
|--------\ http://www.*-*-*.com/ |



Tue, 21 Nov 2000 03:00:00 GMT  
 search/replace in long files

Quote:
Yaron M. Minsky writes:
>The normal thing is to load the file up in chunks, like lines, and do
>the searches on the lines.  But that makes it impossible to capture
>regular expressions that cross line boundaries.   So, if you can't
>load up the whole file, and you don't have any particular useful
>boundaries which you know that the regular expression is not going to
>cross, what do you do?  

        I don't know; you'd probably have to know something more about
the regular expressions being used.  For example, if the expression
would only match across a single newline, then you could keep the last
two lines in memory, and match against them both.  Or if you can
divide the output into paragraphs, or some other smaller unit, and
read and process a single paragraph at a time.  Organized well, this
needn't make the code too obscure; you could write a class which read
from a file object and returned whole paragraphs, and then do the
original processing on a per-paragraph basis.

--
A.M. Kuchling                   http://starship.skyport.net/crew/amk/
The brain thinks not by adding two and two to make four, but like a sheet of
wet paper on which drops of watercolour paints are being splashed, merging
into unforeseen configurations.
    -- Guy Claxton, _Hare Brain, Tortoise Mind_



Tue, 21 Nov 2000 03:00:00 GMT  
 search/replace in long files

Quote:

>    I don't know; you'd probably have to know something more about
> the regular expressions being used.  For example, if the expression
> would only match across a single newline, then you could keep the last
> two lines in memory, and match against them both.  Or if you can
> divide the output into paragraphs, or some other smaller unit, and
> read and process a single paragraph at a time.  Organized well, this
> needn't make the code too obscure; you could write a class which read
> from a file object and returned whole paragraphs, and then do the
> original processing on a per-paragraph basis.

While this makes sense, there should be some kind of general-purpose
solution.  I.e., you should be able to design a regexp to operate on a
stream, and it simply keeps as much of the stream in memory as is
necessary to hold the entire pattern.  You'd probably want to be able
to specify an upper bound on how much memory it should look at.  But
the point is that you don't necessarily know how long a regexp will
end up ahead of time.  Certainly, when I'm using emacs, for example, I
may look for regular expressions that capture any number of lines.

y

--
|--------/            Yaron M. Minsky              \--------|
|--------\ http://www.cs.cornell.edu/home/yminsky/ /--------|



Tue, 21 Nov 2000 03:00:00 GMT  
 search/replace in long files

Quote:


> >       I don't know; you'd probably have to know something more about
> > the regular expressions being used.  For example, if the expression
> > would only match across a single newline, then you could keep the last
> > two lines in memory, and match against them both.  Or if you can
> > divide the output into paragraphs, or some other smaller unit, and
> > read and process a single paragraph at a time.  Organized well, this
> > needn't make the code too obscure; you could write a class which read
> > from a file object and returned whole paragraphs, and then do the
> > original processing on a per-paragraph basis.

> While this makes sense, there should be some kind of general-purpose
> solution.  I.e., you should be able to design a regexp to operate on a
> stream, and it simply keeps as much of the stream in memory as is
> necessary to hold the entire pattern.  You'd probably want to be able
> to specify an upper bound on how much memory it should look at.  But
> the point is that you don't necessarily know how long a regexp will
> end up ahead of time.  Certainly, when I'm using emacs, for example, I
> may look for regular expressions that capture any number of lines.

I guess the re-engine would have make use of memory mapped files
to get this working... not worth the effort if you ask me: pipe
to a grep of your choice and let it do the job for you (they are
designed to work on files which pcre is not). An alternative
would be wrapping a grep library with SWIG as extension module
for Python.

--
Marc-Andre Lemburg
----------------------------------------------------------------------
             | python Pages:  http://starship.skyport.net/~lemburg/  |
              -------------------------------------------------------



Wed, 22 Nov 2000 03:00:00 GMT  
 search/replace in long files

Quote:

>> While this makes sense, there should be some kind of general-purpose
>> solution.  I.e., you should be able to design a regexp to operate on a

Moreover, my gut rumbles that this is an insoluable problem, or
rather, given any regex which contains even a minimum-spanning branch
which involves a '*' or '+' symbol, you may have an infinitely long
matching length.  The only way to know where the match will terminate
would be to apply the pattern to the datastream in question, any other
solution will be the rough equivalent thereof.

A more realistic approach may be to simply decide to blow the memory
overhead to grab the whole stream up, or, failing that, to decide on a
true 'maximum spanning length' and use a sliding window onto the file
to do the regex on.

--



Wed, 22 Nov 2000 03:00:00 GMT  
 search/replace in long files

Quote:


> >> While this makes sense, there should be some kind of general-purpose
> >> solution.  I.e., you should be able to design a regexp to operate on a

> Moreover, my gut rumbles that this is an insoluable problem, or
> rather, given any regex which contains even a minimum-spanning branch
> which involves a '*' or '+' symbol, you may have an infinitely long
> matching length.  The only way to know where the match will terminate
> would be to apply the pattern to the datastream in question, any other
> solution will be the rough equivalent thereof.

> A more realistic approach may be to simply decide to blow the memory
> overhead to grab the whole stream up, or, failing that, to decide on a
> true 'maximum spanning length' and use a sliding window onto the file
> to do the regex on.

Well, actually, that's what I mean by a general purpose solution.
It's not a big deal, really -- just load up as much as you need, and
for most regular expressions, most of the time, that's not going to be
very much.  

Does anyone know how perl handles this issue?

y

--
|--------/            Yaron M. Minsky              \--------|
|--------\ http://www.cs.cornell.edu/home/yminsky/ /--------|



Wed, 22 Nov 2000 03:00:00 GMT  
 search/replace in long files

[Yaron M. Minsky]

Quote:
> [about what to do with regexps that may span N lines, where
>  N > 1 probably but who knows <wink>]

> Well, actually, that's what I mean by a general purpose solution.
> It's not a big deal, really -- just load up as much as you need,
> and for most regular expressions, most of the time, that's not
> going to be very much.

It has less to do with the regular expression than with the expected data.
E.g.,

    ^([^Y]*)Yaron

may usually span a handful of characters, or unbounded gigabytes, depending
on *whose* mailboxes we're usually searching <wink>.

Quote:
> Does anyone know how perl handles this issue?

Same as Python -- slurp the whole file into a string, and let 'er rip.  I
agree with whoever suggested that memory-mapped files are the way to go if
this is a problem that needs to be solved.

or-if-nobody-suggested-that-yet-i-agree-with-myself-ly y'rs  - tim



Wed, 22 Nov 2000 03:00:00 GMT  
 search/replace in long files

Quote:

> It has less to do with the regular expression than with the expected data.
> E.g.,

>     ^([^Y]*)Yaron

> may usually span a handful of characters, or unbounded gigabytes, depending
> on *whose* mailboxes we're usually searching <wink>.

Fully understood.  That's why if you did use a dynamic window size,
you'd probably want to put an upper limit on the buffer size, and have
the regexp code throw an exception when that size is passed

Quote:
> I agree with whoever suggested that memory-mapped files are the way
> to go if this is a problem that needs to be solved.

Does python 1.5.1 come with memory-mapped file support?  I looked
around the web, but all I found was some NT support, and some
fledgling Unix support.

By the way, I might as well mention what my, rather simple,
application is.  All I've done is write a script that runs through a
Netscape binary and remaps some of the language bindings so I can get
support for hebrew.  (yeah, yeah, I know, I can modify the source if I
really want to be useful, but I've got a thesis to write.)

Anyway, I have a simple search-and-replace job to do on a big binary
file (about 10M).  So, my first attempt was a sed script.  It worked
just fine, but it was unspeakably slow.  My next attempt was a python
script, which read in the file line-by-line.  This worked fine, but it
was just as slow as sed.  Then, I tried reading the file in 1Meg chunk
at a time.  This worked great.  It was very fast in comparison to the
other approaches -- the only problem is that it ran the possibility of
cutting the pattern I was looking for at odd places.

And I just tried the simplest approach -- just read the whole file in.
It turns out to be in between the two approaches in efficiency -- not
nearly as slow as the original sed version, but significantly (about
6x) slower than the megabyte-at-a-time approach.

Of course, the best solution in this case would be to read in a
megabyte, and then read to the end of the next line.  But the reason I
started this thread is that it seemed to me like there should be a
general solution.  Maybe memory-mapped files are it.

y

--
|--------/            Yaron M. Minsky              \--------|
|--------\ http://www.cs.cornell.edu/home/yminsky/ /--------|



Wed, 22 Nov 2000 03:00:00 GMT  
 search/replace in long files

[tim]

Quote:
> It has less to do with the regular expression than with the
> expected data. [etc]

[Yaron M. Minsky]

Quote:
> Fully understood.  That's why if you did use a dynamic window
> size, you'd probably want to put an upper limit on the buffer
> size, and have the regexp code throw an exception when that
> size is passed.

That makes it incomprehensible for "normal users" though, and, less
charitably, is pure hackery <0.6 wink>.  Library facilities suffer when they
grow obscure options of use to only one user in ten thousand.  Nothing to
stop someone from writing a wrapper around re, though.

Quote:
> Does python 1.5.1 come with memory-mapped file support?  I looked
> around the web, but all I found was some NT support, and some
> fledgling Unix support.

I haven't kept up with this.  Last I saw, Andrew K (I think!) was having fun
sorting out mem-map incompatibilities among Unices, and the state of the art
was still too confused to put into the core.  The attractions are (a) little
or no change to re code; (b) no need for users to muck with boundary hacks;
and, (c) mem-mapped files would be good for useful things too <wink>.

Quote:
> By the way, I might as well mention what my, rather simple,
> application is.  All I've done is write a script that runs through a
> Netscape binary and remaps some of the language bindings so I can get
> support for hebrew.  (yeah, yeah, I know, I can modify the source if I
> really want to be useful, but I've got a thesis to write.)

> Anyway, I have a simple search-and-replace job to do on a big binary
> file (about 10M).  So, my first attempt was a sed script.  It worked
> just fine, but it was unspeakably slow.  My next attempt was a python
> script, which read in the file line-by-line.  This worked fine, but it
> was just as slow as sed.  Then, I tried reading the file in 1Meg chunk
> at a time.  This worked great.  It was very fast in comparison to the
> other approaches -- the only problem is that it ran the possibility of
> cutting the pattern I was looking for at odd places.

> And I just tried the simplest approach -- just read the whole file in.
> It turns out to be in between the two approaches in efficiency -- not
> nearly as slow as the original sed version, but significantly (about
> 6x) slower than the megabyte-at-a-time approach.

The speed of stuff like this is extremely platform (machine + OS)
dependent -- even the cost of a lone readline slobbers all over the map
across platforms.  Wouldn't be surprised if your timing results were
reversed on a different platform!

Quote:
> Of course, the best solution in this case would be to read in a
> megabyte, and then read to the end of the next line.

Even better would be to write a little lex driver -- lex is optimized for
finding needles in line-oriented haystacks in ways that the Python flavor of
regexps aren't and aren't likely to become.

Quote:
> But the reason I started this thread is that it seemed to me like
> there should be a general solution.

But there already is:  reading the whole file as a single string is
thoroughly general.  What you really wanted was a faster solution, and in
fact were eager to give up generality to get it <0.9 wink>.  I don't think
the need arises often enough to care about, but, if it does, whoever decides
to solve it should beware of platform-specific pseudo-speedups.

Quote:
> Maybe memory-mapped files are it.

They're no more general in this context, but add the *possibility* for
greater speed.

Besides, as soon as you post your Hebrew Netscape, the universe of examples
will twinkle out of existence <wink>.

boundaries-limit-pain-as-well-as-opportunity-ly y'rs  - tim



Fri, 24 Nov 2000 03:00:00 GMT  
 search/replace in long files

Quote:
Tim Peters writes:
>[Yaron M. Minsky]
>> Does python 1.5.1 come with memory-mapped file support?  I looked
>> around the web, but all I found was some NT support, and some
>> fledgling Unix support.

>I haven't kept up with this.  Last I saw, Andrew K (I think!) was having fun
>sorting out mem-map incompatibilities among Unices, and the state of the art
>was still too confused to put into the core.  

        That sums things up pretty exactly; I need to go back and make
another release of it.  I'm not planning to propose including it in
the core, though GvR will do whatever he pleases.  

Quote:
>                                               The attractions are (a) little
>or no change to re code; (b) no need for users to muck with boundary hacks;
>and, (c) mem-mapped files would be good for useful things too <wink>.

        Problem: re.sub doesn't modify strings in-place because Python
strings are immutable, so it would try to return a (very large)
brand-new string with the substitutions.  I'm not sure what the fix
is; should there be a sub_in_place method or should re.sub magically
"know" somehow that an mmap'ed file is mutable, so the changes should
be made in place?  However, mmap objects already have a simple find()
method like string.find; I forget if they have a replace(), but one
could be added.

Quote:
>> And I just tried the simplest approach -- just read the whole file in.
>> It turns out to be in between the two approaches in efficiency -- not
>> nearly as slow as the original sed version, but significantly (about
>> 6x) slower than the megabyte-at-a-time approach.

        That's surprising, Yaron; I'd expect the speed to be about the
same or a little faster, because there's less overhead from reading
the file N times instead of just once.  Is the binary so large that
your machine starts swapping when it reads in the whole file?

--
A.M. Kuchling                   http://starship.skyport.net/crew/amk/
Crowley was in Hell's bad books. [Footnote: Not that Hell has any other kind.]
    -- Terry Pratchett & Neil Gaiman, _Good Omens_



Fri, 24 Nov 2000 03:00:00 GMT  
 search/replace in long files

Quote:

> [tim]
> > It has less to do with the regular expression than with the
> > expected data. [etc]

> [Yaron M. Minsky]
> > Fully understood.  That's why if you did use a dynamic window
> > size, you'd probably want to put an upper limit on the buffer
> > size, and have the regexp code throw an exception when that
> > size is passed.

> That makes it incomprehensible for "normal users" though, and, less
> charitably, is pure hackery <0.6 wink>.  Library facilities suffer when they
> grow obscure options of use to only one user in ten thousand.  Nothing to
> stop someone from writing a wrapper around re, though.

I'm not convinced.  I think this spec is actually much easier to
understand in some ways.  Without the optional memory limit, the
regular expression code would just do the most natural thing --
without a large cost.  Then, people who wanted to be careful could add
bounds to limit the amount of memory allocated, and you'd only get an
error if it was exceeded.

Being restricted to either regular expressions that don't cross
individual lines, or else having to to figure out a priori
what the bounds on a regexp match are seem more complicated, not
less.  The only reason it's no big deal is that most people just read
in the whole file.  But, as I mentioned, this doesn't work well for
large files.  Maybe nobody other than me uses anything but
newline-bounded regular expressions on large files.  I don't see why
that would be the case, though.

Quote:
> The speed of stuff like this is extremely platform (machine + OS)
> dependent -- even the cost of a lone readline slobbers all over the map
> across platforms.  Wouldn't be surprised if your timing results were
> reversed on a different platform!

Well, perhaps.  But it's clear that if you read a big enough file into
memory, you're going to have to do a lot of needless swapping, and
that will cost you on any platform.

Quote:
> Even better would be to write a little lex driver -- lex is optimized for
> finding needles in line-oriented haystacks in ways that the Python flavor of
> regexps aren't and aren't likely to become.

I'll take a look at that.

Quote:
> But there already is:  reading the whole file as a single string is
> thoroughly general.  What you really wanted was a faster solution, and in
> fact were eager to give up generality to get it <0.9 wink>.  I don't think
> the need arises often enough to care about, but, if it does, whoever decides
> to solve it should beware of platform-specific pseudo-speedups.

Again, I don't buy it.  It's not just a speed issue.  If you want to
work on gigabyte-size files, this just doesn't work.  So, there really
is a generality issue here.  Maybe it's not generality that matters so
much, but it is a question of generality.

y

--
|--------/            Yaron M. Minsky              \--------|
|--------\ http://www.cs.cornell.edu/home/yminsky/ /--------|



Fri, 24 Nov 2000 03:00:00 GMT  
 search/replace in long files

Quote:

> Tim Peters writes:

>    Problem: re.sub doesn't modify strings in-place because Python
> strings are immutable, so it would try to return a (very large)
> brand-new string with the substitutions.  I'm not sure what the fix
> is; should there be a sub_in_place method or should re.sub magically
> "know" somehow that an mmap'ed file is mutable, so the changes should
> be made in place?  However, mmap objects already have a simple find()
> method like string.find; I forget if they have a replace(), but one
> could be added.

This isn't such a problem:  you can use the regular expression engine
to find the match and the groups in that match, and then do the
replacement by hand.

I guess the problem here is that you're doing the loop in regular
python code as opposed to the (optimized?) re code.

Quote:
> >> And I just tried the simplest approach -- just read the whole file in.
> >> It turns out to be in between the two approaches in efficiency -- not
> >> nearly as slow as the original sed version, but significantly (about
> >> 6x) slower than the megabyte-at-a-time approach.

>    That's surprising, Yaron; I'd expect the speed to be about the
> same or a little faster, because there's less overhead from reading
> the file N times instead of just once.  Is the binary so large that
> your machine starts swapping when it reads in the whole file?

Well, it's not so surprising, really.  Once you're up to reading in a
megabyte at a time (as opposed to a line at a time), there isn't much
overhead left to take advantage of.  And reading in all ten megabytes
causes enough extra swapping that overall you lose.

y

--
|--------/            Yaron M. Minsky              \--------|
|--------\ http://www.cs.cornell.edu/home/yminsky/ /--------|



Fri, 24 Nov 2000 03:00:00 GMT  
 search/replace in long files

Quote:
Yaron M. Minsky writes:
>I'm not convinced.  I think this spec is actually much easier to
>understand in some ways.  Without the optional memory limit, the
>regular expression code would just do the most natural thing --
>without a large cost.  Then, people who wanted to be careful could add
>bounds to limit the amount of memory allocated, and you'd only get an
>error if it was exceeded.

        I think I've gotten lost somewhere along the way; how does
limiting the memory help with the problem of regular expressions
across multiple lines?  You'd still have the fundamental problem of an
expression potentially matching across the boundary between two chunks
of data.

Quote:
>large files.  Maybe nobody other than me uses anything but
>newline-bounded regular expressions on large files.  I don't see why
>that would be the case, though.

        When you get to the point where you're searching through
gigabyte-sized files, you'd probably forget regular expressions and
create an index or some other auxiliary data structure.  Regular
expressions are most often used to match single components that are
fairly small, though they may be embedded in a much larger file.  For
example, even though an SGML file may be several megabytes long,
you're interested in searching through that bulk for a very small
thing like a <P> tag.  As evidence for this: for a long time Perl's
regex engine had a few fixed limits on repetitions of 32768 (now
removed), but I think few people actually ran into those limits in
practice.

        Memory-mapped files may help, if the OS's virtual memory
algorithms are decent, since regex searching would be very linear in
its pattern of references.

--
A.M. Kuchling                   http://starship.skyport.net/crew/amk/
Being poor is a little like having an earache over a Bank Holiday. All you can
think about is the pain and how long it will be before a healing hand can be
found to take away the anguish.
    -- Tom Baker, in his autobiography



Fri, 24 Nov 2000 03:00:00 GMT  
 search/replace in long files

Quote:

> Yaron M. Minsky writes:
>    I think I've gotten lost somewhere along the way; how does
> limiting the memory help with the problem of regular expressions
> across multiple lines?  You'd still have the fundamental problem of an
> expression potentially matching across the boundary between two chunks
> of data.

Limiting the memory doesn't help make the regular expressions work --
what makes them work is the use of sliding windows that keep in memory
only as much of the file as you need.

What limiting memory simply ensures that you don't load megabytes at a
time if you don't expect megabyte long patterns.

Quote:
>    When you get to the point where you're searching through
> gigabyte-sized files, you'd probably forget regular expressions and
> create an index or some other auxiliary data structure.  Regular
> expressions are most often used to match single components that are
> fairly small, though they may be embedded in a much larger file.  For
> example, even though an SGML file may be several megabytes long,
> you're interested in searching through that bulk for a very small
> thing like a <P> tag.  As evidence for this: for a long time Perl's
> regex engine had a few fixed limits on repetitions of 32768 (now
> removed), but I think few people actually ran into those limits in
> practice.

Well, the question is, do people want to use regular expressions
where:

   a) The safe boundaries to cut on aren't dead obvious (like line
      breaks)
   b) The file as a whole is too big to reasonably/efficiently load
      into memory.

If they do, then the current situation is suboptimal.  I'd think this
would come up in the scanning of things like SGML files for multi-line
patterns, but perhaps nobody runs into this situation.  Certainly,
nobody but me is talking about it on this list, so perhaps that is the
case for the python community.

Quote:
>    Memory-mapped files may help, if the OS's virtual memory
> algorithms are decent, since regex searching would be very linear in
> its pattern of references.

Yeah, although I could imagine it causing a whole lot of unnecessary
swapping.  The OS can't know that once I'm done looking at a piece of
memory, I'm really completely done.

y

--
|--------/            Yaron M. Minsky              \--------|
|--------\ http://www.cs.cornell.edu/home/yminsky/ /--------|



Fri, 24 Nov 2000 03:00:00 GMT  
 search/replace in long files

[Yaron M. Minsky]

Quote:
> That's why if you did use a dynamic window size, you'd probably
> want to put an upper limit on the buffer size, and have the regexp
> code throw an exception when that size is passed.

[tim]

Quote:
> That makes it incomprehensible for "normal users" though, and, less
> charitably, is pure hackery <0.6 wink>.  Library facilities suffer
> when they grow obscure options of use to only one user in ten thousand.
> Nothing to stop someone from writing a wrapper around re, though.

[Yaron]

Quote:
> I'm not convinced.  I think this spec is actually much easier to
> understand in some ways.  Without the optional memory limit, the
> regular expression code would just do the most natural thing --
> without a large cost.  Then, people who wanted to be careful could add
> bounds to limit the amount of memory allocated, and you'd only get an
> error if it was exceeded.

Yes, I think I said "incomprehensible" above <wink>:  (a) the problem it
addresses rarely comes up; and, (b) when it does come up, this doesn't
*solve* it.

Quote:
> Being restricted to either regular expressions that don't cross
> individual lines, or else having to to figure out a priori
> what the bounds on a regexp match are seem more complicated, not
> less.

Read the file into a string, and neither problem arises.

Quote:
>  The only reason it's no big deal is that most people just read
> in the whole file.  But, as I mentioned, this doesn't work well
> for large files.

Doesn't match my experience, as I hinted by saying this stuff was
platform-dependent.   Details below.

Quote:
> Maybe nobody other than me uses anything but newline-bounded regular
> expressions on large files.  I don't see why that would be the case,
> though.

I'd guess because hardly anyone ever has a need to make widespread yet
repetitive changes in huge binary files.  Why *would* they?  Huge binary
files are usually the output of some program, and changes in the former are
usually effected by changing inputs to the latter.  Look at your own example
for confirmation <wink -- but as you noted, the *obvious* way to solve your
problem would have been to edit the Netscape source code and recompile>.

[tim]

Quote:
> The speed of stuff like this is extremely platform (machine + OS)
> dependent ...

[Yaron]

Quote:
> Well, perhaps.  But it's clear that if you read a big enough file into
> memory, you're going to have to do a lot of needless swapping, and
> that will cost you on any platform.

Here's the result of a "read the whole file" vs "read a million bytes at a
time" (followed by a regexp search in either case) test on my machine, for a
12Mb file:

file was size 12582838
time was 1.47

chunk sizes [1000000, 1000000, 1000000, 1000000, 1000000, 1000000,
             1000000, 1000000, 1000000, 1000000, 1000000, 1000000,
             582838]
time was 2.28

A) Chunking it up is significantly slower on this machine (a P5-166 w/ 32Mb
RAM, and about 20 other processes running -- not exactly a state-of-the art
machine anymore).

B) But chunking it up is nevertheless *so* fast that the speed difference is
meaningless to me.

C) Either method runs much slower the first time I run the test, reflecting
initial file-read time (or, more likely, the time to swap everything else
*out* of memory).

D) Of the 20,000+ files on my disk, only a handful are bigger than this one.

E) And I'll never have a need to run a regexp over any of 'em <0.1 wink>.

Here's the code that produced the preceding, on the chance I'm doing the
whole-file read or the chunking in some way that might account for the gross
difference from your results:

import time
import re

fname = "c:/windows/temp/bigpig"
f = open(fname, "rb", 0)
start = time.clock()
huge = f.read()
if re.search('xyz', huge):
    print 'eh?'
finish = time.clock()
print "file was size", len(huge)
print "time was", round(finish-start, 2)
f.close()

f = open(fname, "rb", 0)
start = time.clock()
chunks = []
while 1:
    huge = f.read(1000000)
    if not huge:
        break
    chunks.append(len(huge))
    if re.search('xyz', huge):
        print 'eh?'
finish = time.clock()
print
print "chunk sizes", chunks
print "time was", round(finish-start, 2)

Quote:
> Again, I don't buy it.  It's not just a speed issue.  If you want
> to work on gigabyte-size files, this just doesn't work.

People nuts enough to clump their data into Gb-sized chunks have got a world
of problems worse than a hypothetical need to run regexps over them.
Neither do Python arrays work well when you have a billion entries, and I
don't worry about that either.

Nevertheless, memory-mapped files *could* be used as the basis for a
pleasant solution to both.  They're not problems that I've had, do have, or
ever expect to have -- but if you think there's something that needs to be
solved here, go solve it!   I'll doubtless thank you profusely when God gets
around to kicking me for predicting it's a problem I'll never have <wink>.

huge-files-are-the-devil's-workshop-ly y'rs  - tim



Sat, 25 Nov 2000 03:00:00 GMT  
 
 [ 21 post ]  Go to page: [1] [2]

 Relevant Pages 

1. 2-file awk search-and-replace

2. in-file search/replace question without using mv or cp

3. Search and replace text in a file based on a specific line

4. Search and replace on a large text file

5. searching and replacing a string within a file with rexx

6. How to replace or create a file using the open/create/replace.vi

7. Simplify a long list of REPLACE's ?

8. awk "search and replace"

9. Search and replace

10. Searching/Replacing with special characters.

11. search replace

12. Q: VW2.0/ENVY Cross Class Search and Replace

 

 
Powered by phpBB® Forum Software