need fast parser for comma/space delimited numbers 
Author Message
 need fast parser for comma/space delimited numbers

I have written an application for reading in large amounts of
space/comma delimited numbers from ASCII text files for statistical
processing.

I originally used re expresssions for splitting, but i was able to cut
the time required for data file parsing down to a third by using
string.split on the comma or space.

Still, the app takes about 5 minutes to parse a typical set of data
files. I'd like to drop that down to a minute of possible.

Which means i probably need to wrap in a C module with something like
an sscanf. Or maybe just a function which find the delimiters and
delivers the number parts of string (defined by delimiters) to atoi
and atof functions.

But before i get started, i imagine someone else has already done
this.

anyone have pointers to said code or suggestions? i'll happily post my
code if there is none out there already.

the two file formats look like this:

700 lines like this times about 100 files:

...
356     0.23514 0.1784
357     0.2206  0.22021
358     0.27676 0.41483
359     0.10083 0.33827
360     0.072568        0.3547
361     0.17443 0.41647
362     0.30491 0.27886
363     0.25666 0.32906
364     0.22523 0.46709
365     0.28276 0.65154
...

181 lines like this times about 100 files:
...
    -4     90.43153  99.08258  
    -3     92.77277  100.00000  
    -2     93.88273  98.95287  
    -1     96.51977  98.49538  
    0      99.23279  98.57191  
    1      100.00000 97.05283  
    2      98.52036  93.01269  
...

and occasional file that looks like so:
...
378,  0.001094949000,  0.000031531040,  0.005158320000
379,  0.001231154000,  0.000035215210,  0.005802907000
380,  0.001368000000,  0.000039000000,  0.006450001000
381,  0.001502050000,  0.000042826400,  0.007083216000
382,  0.001642328000,  0.000046914600,  0.007745488000
...

that 0.3547 over on the right: that aint my fault, thats the test
instruments formatting of the of the output strings, which i have zero
control over.

also note the varied number of digits in expression, also outta my
control.

i'd prefer if the first column of the data be parsed as an integer,
but thats not absolutely essential.

les schaffer

here's the core of what i have now. the lines of the ASCII data file
are already in python as a list of strings (strLines passed to
grabArrayData() ).

    def __parseIFF(self, str):

        """Grab one int and the rest floats from string array
        str. Return array with first element independent variable and
        rest dependent variables"""

        array = [string.atoi(str[0])]
        for item in str[1:] :
            array.append( string.atof(item)  )
        return  array

    def __parseFFF(self, str):

        """Grab one set of floats from string array str. Return array with
        first element independent variable and rest dependent
        variables"""

        return map( string.atof, str )

    def __breakStringOnSpace(self, str):

        """break one line str containing numbers in a string format on
        whitespace (or comma), return array with the strings representing the
        numbers."""

        return filter(None, string.splitfields(str, self.splitStr)  )

    def setBreakExp(self, brk):

        """set whether we are using commans instead of white-space for
        splitting"""
        self.splitStr = brk

    def grabArrayData( self, strLines ):

        """ Feed grabDataArray an array of strings containing data,
        strLines.

        grabDataArray returns Numeric arrays for the independent and
        dependent variables."""

        (mPoints, mValues) = self.__createArrays(strLines)

        # self.parse set to either either __parseFFF or __parseIFF in
        # __init__
        parse = self.parse
        breakString = self.__breakStringOnSpace

        for i  in range( 0, len(strLines) ):
            num = parse( breakString( strLines[i] ) )
            mPoints[i] = num[0]
            mValues[ 0:self.rows, i] = num[1:]

        return mPoints, mValues

here's where things stand at the moment:

         186209 function calls in 305.260 CPU seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    86436  105.620    0.001  105.620    0.001 NumberParser.py:161(__parseIFF)
    86632   50.300    0.001   50.300    0.001 NumberParser.py:180(__breakStringOnSpace)
      196    0.450    0.002    0.560    0.003 NumberParser.py:189(__createArrays)
      196  122.010    0.623  278.380    1.420 NumberParser.py:206(grabArrayData)
[snip]

so a little over one-third the time is in performing the for-loop
inside grabArrayData, about 1/3 the time is doing atoi/atof, and 1/6
of the time is breaking strings on white-space or commas.

my next move: move the __parseXFF and __breakSringOnSpace inside the
for loop rather than making (expensive???) function callson each
line. or perhaps doing all the breaking first then the atoi'ing next.

other suggestions?



Wed, 04 Sep 2002 03:00:00 GMT  
 need fast parser for comma/space delimited numbers
Les Schaffer wants speed:

Quote:
> I have written an application for reading in large amounts of
> space/comma delimited numbers from ASCII text files for statistical
> processing.
...
> Still, the app takes about 5 minutes to parse a typical set of data
> files. I'd like to drop that down to a minute of possible.

We can speed up what you've got, but probably not that much!
...

Quote:
> here's the core of what i have now. the lines of the ASCII data file
> are already in python as a list of strings (strLines passed to
> grabArrayData() ).

>     def __parseIFF(self, str):

>    """Grab one int and the rest floats from string array
>    str. Return array with first element independent variable and
>    rest dependent variables"""

>    array = [string.atoi(str[0])]
>    for item in str[1:] :
>        array.append( string.atof(item)  )
>    return  array

First, use "def __parseIFF(self, str, atoi=string.atoi,
atof=string.atof):" and then access those as locals.

Second, benchmark against "int" and "float".

Quote:
>     def __parseFFF(self, str):

>    """Grab one set of floats from string array str. Return array with
>    first element independent variable and rest dependent
>    variables"""

>    return map( string.atof, str )

Same here.

Quote:
>     def __breakStringOnSpace(self, str):

>    """break one line str containing numbers in a string format on
>    whitespace (or comma), return array with the strings representing the
>    numbers."""

>         return filter(None, string.splitfields(str, self.splitStr)  )

First, splitfields is obsolete, use "split". Second, special case
the whitespace case, because that would just be "split(str)".
Third, use locals trick.

Quote:
>     def setBreakExp(self, brk):

>    """set whether we are using commans instead of white-space for
>    splitting"""
>    self.splitStr = brk

>     def grabArrayData( self, strLines ):

>    """ Feed grabDataArray an array of strings containing data,
>    strLines.

>    grabDataArray returns Numeric arrays for the independent and
>    dependent variables."""

>    (mPoints, mValues) = self.__createArrays(strLines)

>    # self.parse set to either either __parseFFF or __parseIFF in
>    # __init__
>         parse = self.parse
>         breakString = self.__breakStringOnSpace

>    for i  in range( 0, len(strLines) ):
>        num = parse( breakString( strLines[i] ) )

For the all floats, all whitespace case, this would just be
 num = map(float, split(strLines[i]))
and that might get you the speed you want.

For the comma case, you might try:
  s = join(split(strLines[i], ','), ' ')
  num = map(float, split(s))
or
  t = split(strLines[i], ',')
  t = map(strip, t)
  num = map(float, t)

With split, join, strip all being string methods optimized by
being default args.

Quote:
>        mPoints[i] = num[0]
>             mValues[ 0:self.rows, i] = num[1:]

>    return mPoints, mValues

- Gordon


Wed, 04 Sep 2002 03:00:00 GMT  
 need fast parser for comma/space delimited numbers


Quote:

>    def __breakStringOnSpace(self, str):

>    """break one line str containing numbers in a string format on
>    whitespace (or comma), return array with the strings representing the
>    numbers."""

>        return filter(None, string.splitfields(str, self.splitStr)  )

Why are you using filter?  Why not just

return string.split(str, self.splitStr)
--

Androgynous poly {*filter*} vanilla {*filter*} het    <*>     http://www.*-*-*.com/
Hugs and backrubs -- I break Rule 6

Three sins: BJ, B&J, B&J  --Aahz



Wed, 04 Sep 2002 03:00:00 GMT  
 need fast parser for comma/space delimited numbers

Quote:
> I have written an application for reading in large amounts of
> space/comma delimited numbers from ASCII text files for statistical
> processing.

> I originally used re expresssions for splitting, but i was able to cut
> the time required for data file parsing down to a third by using
> string.split on the comma or space.

> Still, the app takes about 5 minutes to parse a typical set of data
> files. I'd like to drop that down to a minute of possible.

Hmm, I must be missing something.  The following code takes about 8s on
a Sparc 10 and about 5 on a Linux w/ pentium II 400MHz.

import string, time

line = '356     0.23514 0.1784'

start_time = time.time ()
for i in 90000 * [None]:
    split_line = filter (None, string.split (line))
    n = int (split_line [0])
    x = float (split_line [1])
    y = float (split_line [2])

print time.time () - start_time    

...yet 90000 is about the number of lines you are parsing, right?

Alex.



Wed, 04 Sep 2002 03:00:00 GMT  
 need fast parser for comma/space delimited numbers

    >> Les Schaffer wants speed:

hmmm.... i am going to get a bad reputation ...

    >> We can speed up what you've got, but probably not that much!

your ideas look real good. i will try them first before thinking about
doing a C module.

    >> First, use "def __parseIFF(self, str, atoi=string.atoi,
    >> atof=string.atof):" and then access those as locals.

this is interesting. is there a difference between

def __parseIFF(self, str, atoi=string.atoi):
  ...

and

def __parseIFF(self, str, atoi=string.atoi):
   atoi = string.atoi
   ...

i am guessing there is enough difference, never thought about it till
now. i guess the atoi=string.atoi creates a "static" local copy for
this function, the assignment done only once, whereas the
atoi=string.atoi in the body of the def gets executed every stinkin
time, correct?

    >> Second, benchmark against "int" and "float".

okay. i noticed in the Scientific Python modules K. Hinsen uses
something like this

numb = exec( str )

with str being things like ' 4.235 ', etc. i wonder which is faster?
(thinking out loud)

    >> First, splitfields is obsolete, use "split".

sheeesh. i read the manual all the time and i constantly confuse which
of them is obsolete and which isnt. Someone toss that splitfields out
the window, please!!!!

    >> Second, special case the whitespace case, because that would
    >> just be "split(str)".  Third, use locals trick.

i think i can swing the special case trick, cause code using this class
can know ahead of time if its csv or whitespace.

Quote:
> For the all floats, all whitespace case, this would just be
>  num = map(float, split(strLines[i]))
> and that might get you the speed you want.

okay. is there a big difference between the string.ato[if] and
float/int?

Quote:
> For the comma case, you might try:
>   s = join(split(strLines[i], ','), ' ')
>   num = map(float, split(s))
> or
>  t = split(strLines[i], ',')
>  t = map(strip, t)
>  num = map(float, t)

will give'em a try...

anyone care to take a guesstimate on how much further time i could
save by coding something in C?if i did that, i would write a function
which takes a Python list object (list of string) and passes back a
pair of Numeric array objects (dependent and independent
variables). so i would cut out all the python for looping as well.

many thanks, gordon!

les schaffer



Wed, 04 Sep 2002 03:00:00 GMT  
 need fast parser for comma/space delimited numbers

Quote:

> anyone care to take a guesstimate on how much further time i could
> save by coding something in C?

One other point, don't read the files one line at a time.
That's slow in Python.

My experience with extension modules says that you might expect 6x
improvement. The number can very wildly depending how optimal your Python
solution was in the first place.

Here's some code I did for fun and never used.
http://www.dorb.com/darrell/csv/testCsv.py
There's a thread on deja that describes how other people optimized this
problem in Python. I did it in 'C' just to see how much faster it would be.

I'd give you the deja link but they seem to be suffering a denial of service
attack or something.

--Darrell



Wed, 04 Sep 2002 03:00:00 GMT  
 need fast parser for comma/space delimited numbers
[posted & mailed]

[Les Schaffer]

Quote:
> ...
> Still, the app takes about 5 minutes to parse a typical set of data
> files. I'd like to drop that down to a minute of possible.

Before you go nuts with low-level trickery,

Quote:
> ...
> here's where things stand at the moment:

>          186209 function calls in 305.260 CPU seconds

>    Ordered by: standard name
> ...

Is "305.260 CPU seconds" the "about 5 minutes" you're talking about?  If so,
tell us how long the app takes when *not* using the profiler <0.5 wink --
the profiler is good for getting a sense of relative times, but usually adds
very significant per-call overheads of its own>.

Picking on the most expensive function:

    def __parseIFF(self, str):
        """..."""
        array = [string.atoi(str[0])]
        for item in str[1:] :
            array.append( string.atof(item)  )
        return array

First we can speed it up:

    def __parseIFF(self, str):
        """..."""
        array = map(float, str)  # btw, "str" is a poor name for a list
        array[0] = int(array[0])
        return array

and then you should write it inline.  This self-contained test is close to
your worst-case parsing problem and takes under 25 seconds on my creaky old
P5-166:

def doit():
    data = ["378",
            "   0.001094949000",
            "  0.000031531040",
            "  0.005158320000"]
    _int, _map, _float = int, map, float  # localize for minor gain
    for i in xrange(86436):   # the # of times the parser got called
        array = _map(_float, data)
        array[0] = _int(array[0])

from time import clock
start = clock()
doit()
finish = clock()
print round(finish - start, 3)

no-need-for-c-ly y'rs  - tim



Wed, 04 Sep 2002 03:00:00 GMT  
 need fast parser for comma/space delimited numbers

Quote:


>     >> Les Schaffer wants speed:

> hmmm.... i am going to get a bad reputation ...

Um, I hate to tell you, but... oh, nevermind <wink>.

Quote:
>     >> First, use "def __parseIFF(self, str, atoi=string.atoi,
>     >> atof=string.atof):" and then access those as locals.

> this is interesting. is there a difference between

> def __parseIFF(self, str, atoi=string.atoi):
>   ...

> and

> def __parseIFF(self, str, atoi=string.atoi):
>    atoi = string.atoi
>    ...

> i am guessing there is enough difference, never thought about it till
> now. i guess the atoi=string.atoi creates a "static" local copy for
> this function, the assignment done only once, whereas the
> atoi=string.atoi in the body of the def gets executed every stinkin
> time, correct?

Absolutely. This cheap trick is frequently worth 30% to 60% in
a moderate to tight loop.

Quote:
> > For the all floats, all whitespace case, this would just be
> >  num = map(float, split(strLines[i]))
> > and that might get you the speed you want.

> okay. is there a big difference between the string.ato[if] and
> float/int?

I *think* I recall that int / float were somewhat faster, but I
can't be sure. The major thing here is to get away from extra
function calls. Flatten your loop as much as possible -
function calls are (relatively) expensive.

- Gordon



Wed, 04 Sep 2002 03:00:00 GMT  
 need fast parser for comma/space delimited numbers

Quote:
> Um, I hate to tell you, but... oh, nevermind <wink>.

what .... he was refering to black beauties?????? (what is speed
called on the street these days?????)

Quote:
> Absolutely. This cheap trick is frequently worth 30% to 60% in a
> moderate to tight loop.

this is great. probably worth more to me than this stinking csv parser
;-)

Quote:
> I *think* I recall that int / float were somewhat faster, but I
> can't be sure. The major thing here is to get away from extra
> function calls. Flatten your loop as much as possible - function
> calls are (relatively) expensive.

yeah. tahts next on  the list.

les



Wed, 04 Sep 2002 03:00:00 GMT  
 need fast parser for comma/space delimited numbers

Quote:

> Les Schaffer wants speed:
...
> For the comma case, you might try:
>   s = join(split(strLines[i], ','), ' ')
>   num = map(float, split(s))
> or
>   t = split(strLines[i], ',')
>   t = map(strip, t)
>   num = map(float, t)

As an addition:

    s = replace(strLines[i], ',', ' ')

is faster than

    s = join(split(strLines[i], ','), ' ')

string.translate *might* be again a little faster.

ciao - chris

--

Applied Biometrics GmbH      :     Have a break! Take a ride on Python's
Kaunstr. 26                  :    *Starship* http://starship.python.net
14163 Berlin                 :     PGP key -> http://wwwkeys.pgp.net
PGP Fingerprint       E182 71C7 1A9D 66E9 9D15  D3CC D4D7 93E2 1FAE F6DF
     we're tired of banana software - shipped green, ripens at home



Wed, 04 Sep 2002 03:00:00 GMT  
 need fast parser for comma/space delimited numbers

Quote:


>     >> Second, benchmark against "int" and "float".

> okay. i noticed in the Scientific Python modules K. Hinsen uses

I haven't seen those, but...

Quote:
> something like this

> numb = exec( str )

..I guess you mean eval()

Quote:
> with str being things like ' 4.235 ', etc. i wonder which is faster?
> (thinking out loud)

int/float should be faster because eval creates an intermediate code
object which is then executed by the bytecode interpreter. eval has the
advantage that it automatically distinguishes between ints and floats,
though, and other literals and even expressions for that matter.

It is, of yourse a gaping security hole unless you use rexec.

--
Bernhard Herzog   | Sketch, a drawing program for Unix



Wed, 04 Sep 2002 03:00:00 GMT  
 need fast parser for comma/space delimited numbers
Tim said:

Quote:
> Is "305.260 CPU seconds" the "about 5 minutes" you're talking about?
> If so, tell us how long the app takes when *not* using the profiler
> <0.5 wink -- the profiler is good for getting a sense of relative
> times, but usually adds very significant per-call overheads of its
> own>.

has someone put a spell on my machine?

my code runs the same length of time with and without the profiler.

??????????

les



Wed, 04 Sep 2002 03:00:00 GMT  
 need fast parser for comma/space delimited numbers

Quote:
> Why are you using filter?  Why not just
> return string.split(str, self.splitStr)

(grumble ... moan .... self-loathing.....)

i was doing this:

x = string.split( 'hey boo   whadya got   in your    picnic  basket',' ' )

and getting

['hey', 'boo', '', '', 'whadya', 'got', '', '', 'in', 'your', '', '',
'', 'picnic', '', 'basket']

hey! dats how we dit eet in perl, no????

les



Thu, 05 Sep 2002 03:00:00 GMT  
 need fast parser for comma/space delimited numbers

Quote:
> ...
> start_time = time.time ()
> for i in 90000 * [None]:
> ...
> print time.time () - start_time    

(gustav)~/Engineering/dspring/stoplite/matlab/Jue-Data/: python testTime.py
156.307130933

Quote:
> Hmm, I must be missing something.  The following code takes about 8s
> on a Sparc 10 and about 5 on a Linux w/ pentium II 400MHz.

you think everyone runs on a 400 MHz machine  ;-)

P90 .... i'm due for an upgrade....



Thu, 05 Sep 2002 03:00:00 GMT  
 need fast parser for comma/space delimited numbers

<about eval>

Quote:
> It is, of yourse a gaping security hole unless you use rexec.

Sorry to pick on you (I do admire you for Sketch) but this is a common
myth, which is totally untrue. Think about a random program, say, hmmm...
Sketch. I run it from my account. Now say Sketch wants to let me execute
some random Python code -- how is it a security hole? If I wanted to
delete my file system, I'd do it myself. I don't seen Sketch, so I
can type into it __import__("shutil").rmtree('/'), no I can just rm -rf
myself.

Now, most Python programs (certainly most scientific Python programs) are
not run as CGI's and the like, but rather by the user who wants them to
run. So, using eval/exec without rexec is perfectly allright from a
security POV.

The only problem is that you're executing random Python code, which means
you won't be able to understand bug-reports. But that's a reason to use
eval/exec with the optional dictionaries, not for rexec.

My minor rant for today.

Keep Sketching!
--

http://www.oreilly.com/news/prescod_0300.html
http://www.linux.org.il -- we put the penguin in .com



Thu, 05 Sep 2002 03:00:00 GMT  
 
 [ 25 post ]  Go to page: [1] [2]

 Relevant Pages 

1. DejaNews down (was RE: need fast parser for comma/space delimited numbers)

2. Fortran 77: read second number on a line in a comma delimited text file

3. getting fields NOT comma delimited with commas inside

4. Need Help converting a Clarion DB to comma delimited

5. newbie needs a little help parsing a comma delimited file

6. Space delimited numbers

7. How to convert tab delimited file to space delimited file

8. Parsing Comma delimited files in J

9. Convert comma-delimited records to fixed length records

10. matching records in a comma delimited file

11. comma delimited

12. Comma delimited file problem

 

 
Powered by phpBB® Forum Software