An Efficient Split Function 
Author Message
 An Efficient Split Function

I am reposting this to comp.lang.perl.misc since there seems to be
some confusion (and, I must say, some surprising anger) over the
contents of this post.

Let me outline my motivation for this posting to try to avert any
flame wars: we have a Perl program that processes log files that are
fairly regular --- they have lines split by a field separator
character; the fields are fixed in number; the fields are of a
predictable length.  This Perl program is taking nearly 20 hours to
run, so I decided to conduct a somewhat generalized investigation, and
compare one very small part of the basic process to a C++ version in
order to get a very crude estimate of the type of speedup I might
expect.  I compared the python version as well because I simply wanted
another data point (I fully expected the Python version to run more
slowly than the Perl version).  Note also that I was not trying to
write a *generalized* split function in C++ that mimics that of Perl;
I'm simply trying to write one that will handle the particular data
set that we have need of processing.  I'm fully aware of the power of
Perl, how flexible it is, etc., and that the C++ version that I have
written is specialized, intended to solve a particular problem, has
thereby a variety of limitations, etc., etc.

Let me be clear: I am not, it should be needless to say, waging an
"anti-Perl" war here.  I code in Perl, Python, and C++ a great deal,
and am not interested in why one language is better, broadly
considered, than another.  I am simply interested in a very narrow
comparison, that is intended to provide a starting point for a broader
inquiry.  I was hoping to devise a C++ version that ran in under 10%
of the time of the fasted interpreted version --- an arbitrarily
chosen goal.

The original post is very heavily slanted to the C++ algorithms that I
devised for the comparison.  The Perl code in question is up front for
those not interested in scanning further.

Now, here is my original post:

I have made an effort to write an efficient split function in C++,
comparing it to similar versions written in Python and Perl.  I'm
posting this to the three language communities because I would like
help in answering two questions:  Is my approach in general valid?
and, Are the particular solutions I've come up with in each language
reasonable?  Those who would like to reply to me directly can do so at
the address in my .sig.

An efficient split function is part of the foundation of many programs
--- especially those that, e.g, summarize information in log files of
various sorts.  The other part that is quite important for performance
issues, as I have discovered, is the routine used to read in the "raw"
information.  The idiom used here, quite common I'm sure, might be
thought of as "Read a line; Split on a separator character; Process
the fields" (we might call this the "RSP" idiom).  My focus below is
on the first two parts of this idiom.

I have been testing a set of functions I have written to split
character strings (both NULL-terminated character arrays and C++
strings).  I needed to see what sort of performance improvement I
could get over two interpreted languages I use quite frequently,
Python (version 1.5.2) and Perl (version 5.004_04).

For comparison, I used the following Perl program:

    my $count = 0;

    while (<STDIN>) {

        $count++;
    }
    print "$count\n";

and the following Python program:

    import sys, string

    def main():
        readline = sys.stdin.readline
        split = string.split

        count = 0
        while 1:
            line = readline()
            if not line: break
            l = split(line, '|')
            count = count + 1

        print count

    main()

As can be seen, to substitute for the third part of the RSP idiom I
simply coded a count of the number of lines processed.

Surprisingly, to me, the Python version far outperformed the Perl
version.  Running on 1 million lines of input of 9 fields each, the
Python version ran in just under 16 seconds, the Perl version in just
under 40 seconds (this on a 400Mhz Pentium Linux box). [Author's note:
this is a ratio of 2.5 to 1.]

For the C++ split function, the first time through I tried a very
straightforward approach using the C++ string and vector classes (I'm
using the egcs compiler, version 1.1b).  The split function prototype
is

    int split(vector<string>&, const string&, char);

(I've posted the implementation of the split functions below) and the
main function is:

    main() {
        string line;
        vector<string> v;
        int count = 0;

        while (getline(cin, line)) {
            split(v, line, '|');
            ++count;
        }
        cout << count << endl;
    }

This ran in about 8.5 seconds, (21% of Perl, 54% of Python).  But, this
wasn't the sort of speedup I had in mind, so I kept plowing ahead.

Previous tests of mine comparing the getline() version that works
with C++ strings and that which works with character arrays showed me
that the latter was much faster, albeit less flexible.  So, keeping
the general split algorithm intact, I recoded the split function:

    int split(vector<string>&, const char*, char);

and the main routine:

    main() {
        char line[1024];
        vector<string> v;
        int count = 0;

        while (cin.getline(line, 1024)) {
            split(v, line, '|');
            ++count;
        }
        cout << count << endl;
    }

This ran in about 4.75 seconds (12% of Perl, 30% of Python).  Better, but
I still wanted more.  So I devised a split routine to operate on character
arrays, but to create a vector of "spans" instead of strings:

    typedef pair<const char*, const char*> span;

    int split(vector<span>&, const char*, char);

and a new main routine:

    main() {
        char line[1024];
        vector<span> v(30);
        int count = 0;

        while (cin.getline(line, 1024)) {
            split(v, line, '|');
            ++count;
        }
        cout << count << endl;
    }

This ran in 1.2 seconds (3% of Perl, 7.5% of Python), which sounded about
where I wanted the performance to be.

For comparison purposes, here are the three split routines:

    // Split a string on a given character into a vector of strings
    // The vector is passed by reference and cleared each time
    // The number of strings split out is returned
    int split(vector<string>& v, const string& str, char c)
    {
        v.clear();
        string::const_iterator s = str.begin();
        while (true) {
            string::const_iterator begin = s;

            while (*s != c && s != str.end()) { ++s; }

            v.push_back(string(begin, s));

            if (s == str.end()) {
                break;
            }

            if (++s == str.end()) {
                v.push_back("");
                break;
            }
        }
        return v.size();
    }

    // Split a NULL-terminated character array on a given character into
    // a vector of strings
    // The vector is passed by reference and cleared each time
    // The number of strings split out is returned
    int split(vector<string>& v, const char* s, char c)
    {
        v.clear();
        while (true) {
            const char* begin = s;

            while (*s != c && *s) { ++s; }

            v.push_back(string(begin, s));

            if (!*s) {
                break;
            }

            if (!*++s) {
                v.push_back("");
                break;
            }
        }
        return v.size();
    }

    // Represents a span of a character array
    typedef pair<const char*, const char*> span;

    // Convenience function to set a span
    inline void set_span(span& s, const char* b, const char* e)
    {
        s.first = b;
        s.second = e;
    }

    // Split a NULL-terminated character array on a given character into
    // a vector of spans
    // The vector is of constant size is not cleared each time
    // The number of spans split out is returned
    int split(vector<span>& v, const char* s, char c)
    {
        int i = 0;
        while (true) {
            const char* begin = s;

            while (*s != c && *s) { ++s; }

            set_span(v[i++], begin, s);

            if (!*s) {
                break;
            }

            if (!*++s) {
                set_span(v[i++], 0, 0);
                break;
            }
        }
        return i;
    }

Each of these routines will split the following string:

    <field0>|<field1>|<field2>| ... |<fieldN>

where any of the N fields can be empty, into N+1 fields.

So, what do folks think of this?  Obviously, the particular span
approach above has the drawback that is relies both on a fixed-length
input buffer and a fixed-length vector of spans.  However, for my
purposes, this is safe enough and speed is paramount at this time.

I'm sure there are better ways to approach this, issues I haven't
thought of, speed gains to be found here and there, etc.  I'd be glad
to hear of them.

Bill
--
William S. Lear | Who is there that sees not that this inextricable labyrinth

z o p y r a .   | should  understand  their own  affairs, and, understanding,
c o m           | become inclined to conduct them?    ---William Godwin, 1793



Sat, 17 Nov 2001 03:00:00 GMT  
 An Efficient Split Function

  WSL> I am reposting this to comp.lang.perl.misc since there seems to be
  WSL> some confusion (and, I must say, some surprising anger) over the
  WSL> contents of this post.

  WSL> Let me outline my motivation for this posting to try to avert any
  WSL> flame wars: we have a Perl program that processes log files that are
  WSL> fairly regular --- they have lines split by a field separator
  WSL> character; the fields are fixed in number; the fields are of a
  WSL> predictable length.  This Perl program is taking nearly 20 hours to

how predictable are the field lengths? are they fixed? if so, then
perl's unpack function is much faster than split. try benchmarking it
against your split code.

and if you want the fastest way to do it, try a c loop that looks for
the separator char and replaces it with \0. keep track of the start of
each field with a char * array. this is a single pass over the string
and probably couldn't get much faster. it modifies the string in place
so it is destructive.

<untested c code >

char    *field_ptrs[10] ;
char    **curr_fld_ptr = field_ptrs ;
char    string[] = "test:data:full:of:fields" ;

char    *cp ;

        cp = string ;

        *curr_fld_ptr++ = cp ;

        while( *cp ) {

                if ( *cp == ':' ) {

                        *cp++ = '\0' ;
                        *curr_fld_ptr++ = cp ;
                }
                else {
                        cp++ ;
                }
        }

uri

--
Uri Guttman  -----------------  SYStems ARCHitecture and Software Engineering

Have Perl, Will Travel  -----------------------------  http://www.sysarch.com
The Best Search Engine on the Net -------------  http://www.northernlight.com



Sat, 17 Nov 2001 03:00:00 GMT  
 
 [ 2 post ] 

 Relevant Pages 

1. How to use split function to split on a backslash

2. More efficient than split?

3. Problem with join function (and split function)

4. using split function twice on same line

5. Split function

6. Using SPLIT function with a Period

7. Using a period as a delimiter in the split() function

8. split function that handles quoting..

9. perlcc does not compile when split to array function is used

10. Problem in 'split' function

11. Newbie, split function, and counter

12. Split function in Perl

 

 
Powered by phpBB® Forum Software