Splitting a string every 'n' 
Author Message
 Splitting a string every 'n'

What is the idiomatic way to split a string into a list
containing 'n' character substrings?  I normally do
something like:

while strng:
    substring = strng[:n]
    strng = strng[n:]
    <process substring>

But the performance of this is hopeless for very long strings!
Presumable because there's too much list reallocation?  Can't python
just optimise this by shuffling the start of the list forward?

Any better ideas, short of manually indexing through?  Is there
something like:

for substring in strng.nsplit():
    <process substring>



Sat, 25 Dec 2004 20:50:19 GMT  
 Splitting a string every 'n'
Quote:

> What is the idiomatic way to split a string into a list
> containing 'n' character substrings?  I normally do
> something like:

> while strng:
>     substring = strng[:n]
>     strng = strng[n:]
>     <process substring>

> But the performance of this is hopeless for very long strings!
> Presumable because there's too much list reallocation?  Can't Python
> just optimise this by shuffling the start of the list forward?

> Any better ideas, short of manually indexing through?  Is there
> something like:

> for substring in strng.nsplit():
>     <process substring>

How about:

import re
rex = re.compile('....', re.DOTALL)

for substring in rex.findall(string):
        <process substring>

HTH

Harvey Thomas

_____________________________________________________________________
This message has been checked for all known viruses by the MessageLabs Virus Scanning Service.



Sat, 25 Dec 2004 21:04:34 GMT  
 Splitting a string every 'n'


Quote:

> What is the idiomatic way to split a string into a list
> containing 'n' character substrings?  I normally do
> something like:

> while strng:
>     substring = strng[:n]
>     strng = strng[n:]
>     <process substring>

> But the performance of this is hopeless for very long strings!
> Presumable because there's too much list reallocation?  Can't Python
> just optimise this by shuffling the start of the list forward?

> Any better ideas, short of manually indexing through?  Is there
> something like:

> for substring in strng.nsplit():
>     <process substring>

(i'm replying but as i'm still a newbie i don't know if it's a good idea - but just trying and hope that gurus will correct)

x = []
i0 = 0
for i in range(n,len(strng) + n,n):
    x.append(strng[i0:i])
    i0 = i
map(<processing substring function>, x)

s13.



Sat, 25 Dec 2004 21:46:17 GMT  
 Splitting a string every 'n'

Quote:

> What is the idiomatic way to split a string into a list
> containing 'n' character substrings?  I normally do

I'm not sure there is just one.  I suspect what _feels_
idiomatic to you in this respect depends on where you're
coming from -0- just saw Harvey Thomas post a re-based
solution that is surely quite correct (and perhaps may
even have good performance!) but would just never occur
to me first thing...

Quote:
> something like:

> while strng:
>     substring = strng[:n]
>     strng = strng[n:]
>     <process substring>

> But the performance of this is hopeless for very long strings!

Definitely!

Quote:
> Presumable because there's too much list reallocation?  Can't Python

Yep.

Quote:
> just optimise this by shuffling the start of the list forward?

Not without a lot of trouble that would definitely complicate
the interpreter's code and quite possibly deteriorate performance
for all normal cases that can't easily benefit from such
"sharing" of pieces of one string.

Quote:
> Any better ideas, short of manually indexing through?  Is there

What's wrong with "manually indexing through"?  I assume you mean:

for i in xrange(0, len(strng), n):
    substring = strng[i:i+n]
    process(substring)

and I don't see anything wrong with it -- though I might shrink
it a bit down to

for i in xrange(0, len(strng), n):
    process(strng[i:i+n])

that's basically the same idea.

I'm honestly having a hard time seeing anything wrong with this
solution, as presumably needed to come up with anything BETTER.
DIFFERENT is easy, e.g., on 2.3 or 2.2 + from __future__ import
generators, why not a generator:

def slicer(strng, n):
    for i in xrange(0, len(strng), n):
        yield strng[i:i+n]

and then

for substring in slicer(strng, n):
    process(substring)

but that's really the same code again with a false moustache...

Alex



Sat, 25 Dec 2004 22:09:39 GMT  
 Splitting a string every 'n'


Quote:

> What is the idiomatic way to split a string into a list
> containing 'n' character substrings?  I normally do
> something like:

> while strng:
>     substring = strng[:n]
>     strng = strng[n:]
>     <process substring>

You are asking two different questions in you text and code:
1. How generate explicit list of successie length n substrings
(slices)?
2. How process successie length n substrings (slices), (which can then
be tossed)?

Second is easier than first: both require attention to possibility of
remainder of length less than n.
...

Quote:
> Any better ideas, short of manually indexing through?

What, pray tell, is wrong with doing the simple obvious thing that you
can program correctly in a minute or two?

Quote:
>  Is there something like:

> for substring in strng.nsplit():
>     <process substring>

Note that this says that (2) rather that (1) above is your question.
For 2.2+, write a generator that manually indexes thru sequence,
returning successive slices.  A second param could determine whether a
short tail is returned or suppressed.

Terry J. Reedy



Sat, 25 Dec 2004 22:15:49 GMT  
 Splitting a string every 'n'
Simon> What is the idiomatic way to split a string into a list
Simon> containing 'n' character substrings?  I normally do
Simon> something like:

Simon> while strng:
Simon>     substring = strng[:n]
Simon>     strng = strng[n:]
Simon>     <process substring>

How about this?

        for start in range(0, len(strng), n):
            substring = strng[start:start+n]
            <process substring>

--



Sat, 25 Dec 2004 22:35:44 GMT  
 Splitting a string every 'n'

Quote:

> What is the idiomatic way to split a string into a list
> containing 'n' character substrings?  I normally do
> something like:

> while strng:
>    substring = strng[:n]
>    strng = strng[n:]
>    <process substring>

> But the performance of this is hopeless for very long strings!
> Presumable because there's too much list reallocation?  Can't Python
> just optimise this by shuffling the start of the list forward?

> Any better ideas, short of manually indexing through?  Is there
> something like:

> for substring in strng.nsplit():
>    <process substring>

No, you pretty much have to slice out the range you want, ie.
    substring = string[i:i+n]

--

8-CPU Cluster, Hosting, NAS, Linux, LaTeX, python, vim, mutt, tin



Sat, 25 Dec 2004 23:48:17 GMT  
 Splitting a string every 'n'

Quote:
> > But the performance of this is hopeless for very long strings!
> > Presumable because there's too much list reallocation?  Can't Python
> > just optimise this by shuffling the start of the list forward?

Using generators here compares favorably with a smart while loop.  They have
the advantage of separating the iteration from the processing, so you can
actually reuse gen_substring since it allows you to iterate over the
n-length substrings:

#! /usr/bin/env python

from __future__ import generators
from time import clock

def gen_substring(s, n):
    i = 0
    end = len(s)
    while i <= end:
        j = i + n
        yield s[i:j]
        i = j

def do_gen(s, n):
    for sub in gen_substring(s, n):
        sub.upper()

def do_while_simple(s, n):
    while s:
        sub = s[:n]
        s = s[n:]
        sub.upper()

def do_while_smarter(s, n):
    i = 0
    end = len(s)
    while i <= end:
        j = i + n
        sub = s[i:j]
        i = j
        sub.upper()

def time_it(f, *args, **kwargs):
    start = clock()
    f(*args, **kwargs)
    end = clock()
    print "%s: %1.3f" % (f.func_name, end - start)

n = 4
size = 100000
s = 'a' * size

time_it(do_gen, s, n)
time_it(do_while_simple, s, n)
time_it(do_while_smarter, s, n)

-



Sun, 26 Dec 2004 00:24:26 GMT  
 Splitting a string every 'n'

Quote:

>import re
>rex = re.compile('....', re.DOTALL)

To work with any int n, change to one of these

rex = re.compile('.{,%s}'%n, re.DOTALL)  # keeps remainder segment
rex = re.compile('.{%s}'%n, re.DOTALL)  # discards remainder segment

Quote:

>for substring in rex.findall(string):
>    <process substring>

Huaiyu


Sun, 26 Dec 2004 01:50:54 GMT  
 Splitting a string every 'n'
Quote:

> > What is the idiomatic way to split a string into a list
> > containing 'n' character substrings?  I normally do
> > something like:

> > while strng:
> >     substring = strng[:n]
> >     strng = strng[n:]
> >     <process substring>

> > But the performance of this is hopeless for very long strings!
> > Presumable because there's too much list reallocation?  Can't Python
> > just optimise this by shuffling the start of the list forward?

> > Any better ideas, short of manually indexing through?  Is there
> > something like:

> > for substring in strng.nsplit():
> >     <process substring>

Using python2:

[s[i:i+n] for i in range(0,len(s),n)]

 where:
 s - is the string to split
 n - is the number of characters to break at
 i - is some throwaway variable (previous value of i *not* protected)

Rich



Sun, 26 Dec 2004 02:21:54 GMT  
 Splitting a string every 'n'

    Huaiyu> To work with any int n, change to one of these

    Huaiyu> rex = re.compile('.{,%s}'%n, re.DOTALL)  # keeps remainder segment

I think you want {1,%s}.  Note the spurious empty string at the end if at
least one character isn't required:

    >>> import re
    >>> rex = re.compile(r"(.{,4})")
    >>> re.findall(rex, "abcd")
    ['abcd', '']
    >>> rex = re.compile(r"(.{1,4})")
    >>> re.findall(rex, "abcd")
    ['abcd']
    >>> rex = re.compile(r"(.{,4})")
    >>> re.findall(rex, "abcde")
    ['abcd', 'e', '']
    >>> rex = re.compile(r"(.{1,4})")
    >>> re.findall(rex, "abcde")
    ['abcd', 'e']

--
Skip Montanaro

consulting: http://manatee.mojam.com/~skip/resume.html



Sun, 26 Dec 2004 03:10:20 GMT  
 Splitting a string every 'n'

Quote:

> What is the idiomatic way to split a string into a list
> containing 'n' character substrings?  I normally do
> something like:

> while strng:
>     substring = strng[:n]
>     strng = strng[n:]
>     <process substring>

> But the performance of this is hopeless for very long strings!
> Presumable because there's too much list reallocation?  Can't Python
> just optimise this by shuffling the start of the list forward?

> Any better ideas, short of manually indexing through?  Is there
> something like:

> for substring in strng.nsplit():
>     <process substring>

I have a handy class I use for things like this:

class Group:
  def __init__(self, l, size):
    self.size=size
    self.l = l

  def __getitem__(self, group):
    idx = group * self.size
    if idx > len(self.l):
      raise IndexError("Out of range")
    return self.l[idx:idx+self.size]

I use it mainly for grouping things like:
for x,y in Group([1,2,3,4,5,6,7,8,...],2):
  process_coords(x,y)
but its also applicable to your problem, and works neatly with
strings.

try:

for substring in Group(string, n):
  <process substring>

Don't you just love python's polymorphism!

You don't state what you want to do if the string isn't a multiple of
N characters.  This version includes the shorter string at the end.

Brian.



Sun, 26 Dec 2004 03:59:48 GMT  
 Splitting a string every 'n'

Quote:

>    Huaiyu> To work with any int n, change to one of these

>    Huaiyu> rex = re.compile('.{,%s}'%n, re.DOTALL)  # keeps remainder segment

>I think you want {1,%s}.  Note the spurious empty string at the end if at
>least one character isn't required:

Yes, you're right.  Teaches me to test before post.

Huaiyu



Sun, 26 Dec 2004 05:45:45 GMT  
 Splitting a string every 'n'


Quote:

>> What is the idiomatic way to split a string into a list
>> containing 'n' character substrings?  I normally do
>> something like:

>> while strng:
>>     substring = strng[:n]
>>     strng = strng[n:]
>>     <process substring>

>> But the performance of this is hopeless for very long strings!
>> Presumable because there's too much list reallocation?  Can't Python
>> just optimise this by shuffling the start of the list forward?

>> Any better ideas, short of manually indexing through?  Is there
>> something like:

>> for substring in strng.nsplit():
>>     <process substring>

>I have a handy class I use for things like this:

>class Group:
>  def __init__(self, l, size):
>    self.size=size
>    self.l = l

>  def __getitem__(self, group):
>    idx = group * self.size
>    if idx > len(self.l):
>      raise IndexError("Out of range")
>    return self.l[idx:idx+self.size]

>I use it mainly for grouping things like:
>for x,y in Group([1,2,3,4,5,6,7,8,...],2):
>  process_coords(x,y)
>but its also applicable to your problem, and works neatly with
>strings.

>try:

>for substring in Group(string, n):
>  <process substring>

>Don't you just love python's polymorphism!

>You don't state what you want to do if the string isn't a multiple of
>N characters.  This version includes the shorter string at the end.

>Brian.

That's what I wanted!
--
Simon Foster
Cheltenham
England


Mon, 27 Dec 2004 04:32:44 GMT  
 
 [ 14 post ] 

 Relevant Pages 

1. newbie: emacs 'split-string'

2. How about an 'every' command

3. STRING 'make'/'remake'

4. Getting Ascii string from hex 'string'

5. 'split' creates extra output

6. Splitting 'and' conditions into multiple conditions

7. splitting using '\' as a delimiter

8. Behavior of 'split' command

9. delimiter '::' with split

10. String#split(' ') and whitespace (perl user's surprise)

11. string.split and re.split inconsistency

12. re.split vs. string.split

 

 
Powered by phpBB® Forum Software