string.join is abysmally slow 
Author Message
 string.join is abysmally slow

Greetings,

I've run into a performance problem in one of my functions, and wonder
if I could get some recommendations on how to speed things up.

What I'm trying to do is read in a textfile containing e-mail
addresses, one per line, and use them to build a regular expression
object in the form "address1|address2|address3|addressN" to search
against.

I'm using string.join to concatenate the addresses together, separated
by a `|'.  The problem is that string.join is unacceptably slow in
this task.  The following program takes 37 seconds on a PIII/700 to
process a 239-line file!

--------------------------------------------------------------------

import fileinput, re, string
list = []

for line in fileinput.input(textfile):
    # Comment or blank line?
    if line == '' or line[0] in '#':
        continue
    else:
        list.append(string.strip(line))
        # "address1|address2|address3|addressN"
        regex = string.join(list,'|')
        regex = '"' + regex + '"'
        reo = re.compile(regex, re.I)

--------------------------------------------------------------------

_____________________________________________________________________________
http://www.*-*-*.com/ - Yahoo! Movies
- Now showing: Dude Where's My Car, The Wedding Planner, Traffic..



Fri, 03 Oct 2003 06:59:11 GMT  
 string.join is abysmally slow

### The name list is a built-in, avoid using such names
joinList = []

for line in buf.split("\012"):
    # Comment or blank line?
    if line == '' or line[0] in '#':
        continue
    else:
        joinList.append(string.strip(line))

##### Move the following out of the loop

# "address1|address2|address3|addressN"
regex = string.join(joinList,'|')
regex = '"' + regex + '"'     #<-- Are you matching the quotes ?
reo = re.compile(regex, re.I)

################ Another approach

import re, string
buf="""
# xxxx

re1

re2
#
"""
sep=")|("
buf = re.sub("\012+",sep, re.sub("#.*", "",buf) )
breakOff = len(sep)-1
print buf[breakOff:-breakOff]

"""output --> (re1)|(re2)"""

--Darrell

Quote:

> I've run into a performance problem in one of my functions, and wonder
> if I could get some recommendations on how to speed things up.



Fri, 03 Oct 2003 07:21:08 GMT  
 string.join is abysmally slow
Hello,

    I'd think that there has to be a better way to go about the problem. I
mean, I'd imagine its slow because strings are immutable; every time you
append something new, it has to allocate a new block of memory to be the
proper size, copy everything over, then deallocate the old one if its not
referenced anymore. This is just a guess, mind you.

    If your file is strictly one-address-per line, why not 'list =
file.readlines()' then 'if email in list:'.  Both operations take a fraction
of a second on my p2-400mhz. I tested it with a mock-up file of 239 pretend
email addresses I generated.

    Then again... after loading the entire file into a list of 'lines',
"regex = "|".join(lines)" is also nearly instantaneous for me. What version
of python are you using?

--Stephen
(to reply, remove 'NOSPAM' and replace with 'seraph')


Quote:
> Greetings,

> I've run into a performance problem in one of my functions, and wonder
> if I could get some recommendations on how to speed things up.

> What I'm trying to do is read in a textfile containing e-mail
> addresses, one per line, and use them to build a regular expression
> object in the form "address1|address2|address3|addressN" to search
> against.

> I'm using string.join to concatenate the addresses together, separated
> by a `|'.  The problem is that string.join is unacceptably slow in
> this task.  The following program takes 37 seconds on a PIII/700 to
> process a 239-line file!

> --------------------------------------------------------------------

> import fileinput, re, string
> list = []

> for line in fileinput.input(textfile):
>     # Comment or blank line?
>     if line == '' or line[0] in '#':
>         continue
>     else:
>         list.append(string.strip(line))
>         # "address1|address2|address3|addressN"
>         regex = string.join(list,'|')
>         regex = '"' + regex + '"'
>         reo = re.compile(regex, re.I)

> --------------------------------------------------------------------

____________________________________________________________________________
_

Quote:
> http://movies.yahoo.com.au - Yahoo! Movies
> - Now showing: Dude Where's My Car, The Wedding Planner, Traffic..

-----= Posted via Newsfeeds.Com, Uncensored Usenet News =-----
http://www.newsfeeds.com - The #1 Newsgroup Service in the World!
-----==  Over 80,000 Newsgroups - 16 Different Servers! =-----


Fri, 03 Oct 2003 07:29:09 GMT  
 string.join is abysmally slow

Quote:

>Greetings,

>I've run into a performance problem in one of my functions, and wonder
>if I could get some recommendations on how to speed things up.

>What I'm trying to do is read in a textfile containing e-mail
>addresses, one per line, and use them to build a regular expression
>object in the form "address1|address2|address3|addressN" to search
>against.

>I'm using string.join to concatenate the addresses together, separated
>by a `|'.  The problem is that string.join is unacceptably slow in
>this task.  The following program takes 37 seconds on a PIII/700 to
>process a 239-line file!

Yes, the number one rule on optimization is not to make assumptions but to
profile instead :-) I didn't do that, but I would bet it's the re.compile being
called 239 times that wastes the most time. Just do one re.compile after having
built the regular expression and it should be a lot faster.

Gerhard

Quote:

>--------------------------------------------------------------------

>import fileinput, re, string
>list = []

>for line in fileinput.input(textfile):
>    # Comment or blank line?
>    if line == '' or line[0] in '#':
>        continue
>    else:
>        list.append(string.strip(line))
>        # "address1|address2|address3|addressN"
>        regex = string.join(list,'|')
>        regex = '"' + regex + '"'
>        reo = re.compile(regex, re.I)

--
mail:   gerhard <at> bigfoot <dot> de
web:    http://highqualdev.com


Fri, 03 Oct 2003 06:28:48 GMT  
 string.join is abysmally slow

Quote:
>I've run into a performance problem in one of my functions, and wonder
>if I could get some recommendations on how to speed things up.
...
>I'm using string.join to concatenate the addresses together, separated
>by a `|'.  The problem is that string.join is unacceptably slow in
>this task.  The following program takes 37 seconds on a PIII/700 to
>process a 239-line file!

        If "|".join is slow, use it sparingly?  The real problem seems to be
        that you are recreating the regexp about 238 times or more, and then
        throwing the result away.  Create the regexp once after the loop has
        completed.  You could also re.escape() the mail addresses:

for line in fileinput.input(textfile):
    if line != '' and line[0] != '#':
        list.append(re.escape(string.strip(line)))

# "address1|address2|address3|addressN"
reo = re.compile('|'.join(list), re.I)

        BR,
                                                Pekka



Fri, 03 Oct 2003 07:46:42 GMT  
 string.join is abysmally slow

Quote:

>         If "|".join is slow, use it sparingly?  The real problem
>         seems to be that you are recreating the regexp about 238
>         times or more, and then throwing the result away.  Create
>         the regexp once after the loop has completed.

Right.  I don't know why I didn't notice this.  Thanks.  After moving
the string.join and re.compile outside the for loop, the same function
now processes 10,000 lines in under 15 seconds.  Just a slight
improvement.  :-)

Cheers,
Graham

_____________________________________________________________________________
http://movies.yahoo.com.au - Yahoo! Movies
- Now showing: Dude Where's My Car, The Wedding Planner, Traffic..



Fri, 03 Oct 2003 10:25:57 GMT  
 string.join is abysmally slow

Quote:

> Greetings,

> I've run into a performance problem in one of my functions, and wonder
> if I could get some recommendations on how to speed things up.

> What I'm trying to do is read in a textfile containing e-mail
> addresses, one per line, and use them to build a regular expression
> object in the form "address1|address2|address3|addressN" to search
> against.

> I'm using string.join to concatenate the addresses together, separated
> by a `|'.  The problem is that string.join is unacceptably slow in
> this task.  The following program takes 37 seconds on a PIII/700 to
> process a 239-line file!

> --------------------------------------------------------------------

> import fileinput, re, string
> list = []

> for line in fileinput.input(textfile):
>     # Comment or blank line?
>     if line == '' or line[0] in '#':
>         continue
>     else:
>         list.append(string.strip(line))
>         # "address1|address2|address3|addressN"
>         regex = string.join(list,'|')
>         regex = '"' + regex + '"'
>         reo = re.compile(regex, re.I)

> --------------------------------------------------------------------

I tried to run your code with a 240 line input file and it ran in no time in a
PII/233 Linux box.

Anyway, I believe that you want to put the last three lines of code outside of
the for loop.

Sebrosa



Fri, 03 Oct 2003 11:03:22 GMT  
 
 [ 7 post ] 

 Relevant Pages 

1. Oops: difference in operation of string.join and ''.join

2. : string first and string last too slow in 8.0

3. string first and string last too slow in 8.0

4. As400 Slow when doing joins in browse list

5. slow - am I asking too much

6. string.join question

7. string.join() syntax quirky?

8. bug in string.join()?

9. Joining string to a variable (newbie)

10. String.join revisited (URGENT for 1.6)

11. string.join() vs % and + operators

12. join vs format for giant strings?

 

 
Powered by phpBB® Forum Software