string.join() vs % and + operators 
Author Message
 string.join() vs % and + operators

Whereas Python's strings are immutable, there is potentially a strong
incentive to get them right the first time.  In my applications, I
want to create strings that have many fields within them.  I assume that
it would be nice to be able to modify the fields without creating a new
string any time its contents get changed, but I don't think that python
gives any nice way to do this.  So, I'm stuck with building the string
from many little pieces.  The 'many' part of this gives me some worry
about efficiency, which it is better not to worry about, so I did a
brief test to see if there is ugly downside to this.  I ran the
following script:

#  Start of Script
import string
# Create an array of little strings and a format to join them
s = []
f = ''
for i in range(100):
        s.append(`i`)
        f = f + '%s'    

print "Start of Way 1 -- Create with a big format"
for i in range(100000):
        z = f % tuple(s)
print "end of Way 1"
print z
raw_input()
print "Start of Way 2 -- Create with a join"
for i in range(100000):
        z = string.join(s, '')
print "End of Way 2"
print z
raw_input()
print "Start of Way 3"
for i in range(100000):
    z = ''
    for j in s:
            z = z + j
print "End of Way 3"
print z

#  End of Script

This ran amazingly fast on my Pentium 200 Mhz -- around 11 seconds for
Way 1, and 7 for Way 2.  So, either way, Python can put together about
1 million little strings in a second.  Way 3, the way that one would
expect to be bad, recreating the string with each concatenation, was
much slower, but only took about 1 minute.  Surprisingly swift as well.

Anybody have anything to add to this?  Are there any related pitfalls
that I may have missed?

Al



Tue, 18 Sep 2001 03:00:00 GMT  
 string.join() vs % and + operators
We have an application where many thousands of small strings are to be
inserted and deleted from a very large buffer.
When you have 100k different small strings to replace in 4meg of text the
string.join seems to work well.
I don't know where the trade off is but you could use re.sub

One problem I still don't know how best to avoid is extreme memory
consumption. When you have these large objects around they can be referenced
from higher level objects and not get destroyed until the program exits. We
should have thought about this from the begining of our project.

def insertDeleteList(inbuf, l):
        """ Insert and delete segments in a buffer
        l is a list of (start, string, end) The input 'l' must be sorted
        If start and end are equal then string is inserted at that point
        If end > start then this range is deleted.
        If end < start then this range is duplicated
        """
        splitBuf=[]
        last=0
        for i in l:
                b=inbuf[last:i[0]]
                splitBuf.append(b)
                splitBuf.append(i[1])
                last=i[2]       # Advance past some buffer here
        splitBuf.append(inbuf[last:])
        return string.join(splitBuf,'')



Tue, 18 Sep 2001 03:00:00 GMT  
 string.join() vs % and + operators

Quote:

> Yup.  Using '+' for strings really bites when you've got long (or
> potentially long) strings.  The average size of the string you're
> concatenating is about a hundred characters; suppose you're doing CGI
> work with redirects, and you're looking at closer to three hundred
> characters a pop.

> Try adding a 1K pre-string to the front of each of your Ways and see
> what happens to the speed.

Time went up from 69 secondes to 115.  So it takes only a little bit
longer to join the strings than it does to skip over 1k exach time.

Al



Tue, 18 Sep 2001 03:00:00 GMT  
 string.join() vs % and + operators

Quote:

> Whereas Python's strings are immutable, there is potentially a strong
> incentive to get them right the first time.  In my applications, I
> want to create strings that have many fields within them.  I assume that
> it would be nice to be able to modify the fields without creating a new
> string any time its contents get changed, but I don't think that Python
> gives any nice way to do this.

Look into the array module.  (or the NumArray module from LLNL)
Depending on the specifics of your applications, it may allow you to do
what you want.


Tue, 18 Sep 2001 03:00:00 GMT  
 string.join() vs % and + operators
Tim Peters wrote :

(All kinds of [snipped] insights and outsights)

Great stuff, Tim.  You are right. Your method boosts speed by
a factor of 5 over the fastest of my 3 methods, which already looked
to be fast enough not to worry about.  In the applications I'm looking
at, building records from a few dozen fields each and then writing them
out to indexed files,  the record building should take much less time
than the I/O and record manager.  So, no matter how fast or slow
Python is compared to brand X, the internal processing will be pretty
insignificant as a part of the total run time.

Does-the-next-Python-book-tell-how-to-finish-a-message-so-that-the-reader's-IQ-equals-their-percentage-of-comprehension?

Al



Tue, 18 Sep 2001 03:00:00 GMT  
 string.join() vs % and + operators


Quote:

>This ran amazingly fast on my Pentium 200 Mhz -- around 11 seconds for
>Way 1, and 7 for Way 2.  So, either way, Python can put together about
>1 million little strings in a second.  Way 3, the way that one would
>expect to be bad, recreating the string with each concatenation, was
>much slower, but only took about 1 minute.  Surprisingly swift as well.

>Anybody have anything to add to this?  Are there any related pitfalls
>that I may have missed?

Yup.  Using '+' for strings really bites when you've got long (or
potentially long) strings.  The average size of the string you're
concatenating is about a hundred characters; suppose you're doing CGI
work with redirects, and you're looking at closer to three hundred
characters a pop.

Try adding a 1K pre-string to the front of each of your Ways and see
what happens to the speed.
--

Hugs and backrubs -- I break Rule 6       <*>       http://www.*-*-*.com/
Androgynous poly {*filter*} vanilla {*filter*} het

Why is this newsgroup different from all other newsgroups?



Wed, 19 Sep 2001 03:00:00 GMT  
 string.join() vs % and + operators
[Al Christians]

Quote:
> Whereas Python's strings are immutable, there is potentially a strong
> incentive to get them right the first time.

Not really much more than in a language with mutable strings -- if you
overwrite substrings in one of the latter, it's going to be s-l-o-w when the
length changes.  OTOH, if your fields are 50's-style fixed-width <wink>, a
Python character array (array.array('c')) makes a fine mutable string.

Quote:
> In my applications, I want to create strings that have many fields
> within them.  I assume that it would be nice to be able to modify
> the fields without creating a new string any time its contents get
> changed, but I don't think that Python gives any nice way to do this.

Maybe see above?  I'm not sure what you mean.  Is there any language that
*does* give you "a nice way to do this", keeping in mind that you're worried
about efficiency too?  E.g., use "substr" on the left-hand side of a Perl
string assignment, and under the covers it's going to copy the whole
thing -- even if the length doesn't change:

    $a = "gold you so";
    $b = $a;
    substr($a, 0, 1) = "t";
    print "$a\n$b\n";

prints

    told you so
    gold you so

You can do better than this in Python as-is, although you need a little more
typing:

    import array
    a = array.array('c', "gold you so")
    b = a
    a[0] = "t"
    print a.tostring()
    print b.tostring()

In return for letting you change a[0] in-place, this prints "told you so"
twice.

Quote:
> So, I'm stuck with building the string from many little pieces.

Think of it instead as an opportunity to excel <wink>.

Quote:
> The 'many' part of this gives me some worry about efficiency, which it
> is better not to worry about, so I did a  brief test to see if there is
> ugly downside to this.  I ran the following script:

[tries a long "%s%s%s..." format, string.join, and repeated catenation;
 discovers the 2nd is fastest, the first 2nd-fastest, and third much slower
]

Quote:
> ...
> Way 3, the way that one would expect to be bad, recreating the string
> with each concatenation, was much slower, but only took about 1 minute.
> Surprisingly swift as well.

It can be very much worse, of course -- it's a quadratic-time approach, and
you're helped here in that the final length of your string is only a few
hundred characters.

Quote:
> Anybody have anything to add to this?

Maybe the array module; maybe not.

Quote:
> Are there any related pitfalls that I may have missed?

Not if you stick to string.join -- it's reliably good at this.  I'll attach
a rewrite of your timing harness that avoids the common Python timing
pitfalls, and adds a fourth method showing that array.tostring() blows
everything else out of the water.  But then doing a length-changing slice
assignment to a character array is like doing a length-changing assignment
to a Python list:  under the covers, everything "to the right" is shifted
left or right as needed to keep the array contiguous; mutability can be
expensive in under-the-cover ways.

pay-now-pay-later-or-pay-all-the-time-ly y'rs  - tim

import string
N = 100
S = []
for i in range(N):
    S.append(`i`)
F = "%s" * N

# for method 4 (character array)
import array
SARRAY = array.array('c')
for i in S:
    SARRAY.fromstring(i)

# if time.clock has good enough resolution (it does under Windows),
# no point to looping more often than this
indices = range(10000)

def f1(s=S, f=F):
    for i in indices:
        z = f % tuple(s)
    return z

def f2(s=S, join=string.join):
    for i in indices:
        z = join(s, '')
    return z

def f3(s=S):
    for i in indices:
        z = ''
        for j in s:
            z = z + j
    return z

def f4(s=SARRAY):
    for i in indices:
        z = s.tostring()
    return z

def timeit(f):
    from time import clock
    start = clock()
    result = f()
    finish = clock()
    print f.__name__, round(finish - start, 2)
    return result

z1 = timeit(f1)
z2 = timeit(f2)
z3 = timeit(f3)
z4 = timeit(f4)
assert z1 == z2 == z3 == z4



Wed, 19 Sep 2001 03:00:00 GMT  
 
 [ 7 post ] 

 Relevant Pages 

1. join vs format for giant strings?

2. Oops: difference in operation of string.join and ''.join

3. Stylistic question: returning strings vs. pointers to strings

4. string index vs split, for vs foreach

5. string object methods vs string module functions

6. string.join question

7. join vs instances

8. string.join() syntax quirky?

9. string.join is abysmally slow

10. bug in string.join()?

11. Joining string to a variable (newbie)

12. String.join revisited (URGENT for 1.6)

 

 
Powered by phpBB® Forum Software