I am reposting this to comp.lang.perl.misc since there seems to be
some confusion (and, I must say, some surprising anger) over the
contents of this post.
Let me outline my motivation for this posting to try to avert any
flame wars: we have a Perl program that processes log files that are
fairly regular --- they have lines split by a field separator
character; the fields are fixed in number; the fields are of a
predictable length. This Perl program is taking nearly 20 hours to
run, so I decided to conduct a somewhat generalized investigation, and
compare one very small part of the basic process to a C++ version in
order to get a very crude estimate of the type of speedup I might
expect. I compared the python version as well because I simply wanted
another data point (I fully expected the Python version to run more
slowly than the Perl version). Note also that I was not trying to
write a *generalized* split function in C++ that mimics that of Perl;
I'm simply trying to write one that will handle the particular data
set that we have need of processing. I'm fully aware of the power of
Perl, how flexible it is, etc., and that the C++ version that I have
written is specialized, intended to solve a particular problem, has
thereby a variety of limitations, etc., etc.
Let me be clear: I am not, it should be needless to say, waging an
"anti-Perl" war here. I code in Perl, Python, and C++ a great deal,
and am not interested in why one language is better, broadly
considered, than another. I am simply interested in a very narrow
comparison, that is intended to provide a starting point for a broader
inquiry. I was hoping to devise a C++ version that ran in under 10%
of the time of the fasted interpreted version --- an arbitrarily
chosen goal.
The original post is very heavily slanted to the C++ algorithms that I
devised for the comparison. The Perl code in question is up front for
those not interested in scanning further.
Now, here is my original post:
I have made an effort to write an efficient split function in C++,
comparing it to similar versions written in Python and Perl. I'm
posting this to the three language communities because I would like
help in answering two questions: Is my approach in general valid?
and, Are the particular solutions I've come up with in each language
reasonable? Those who would like to reply to me directly can do so at
the address in my .sig.
An efficient split function is part of the foundation of many programs
--- especially those that, e.g, summarize information in log files of
various sorts. The other part that is quite important for performance
issues, as I have discovered, is the routine used to read in the "raw"
information. The idiom used here, quite common I'm sure, might be
thought of as "Read a line; Split on a separator character; Process
the fields" (we might call this the "RSP" idiom). My focus below is
on the first two parts of this idiom.
I have been testing a set of functions I have written to split
character strings (both NULL-terminated character arrays and C++
strings). I needed to see what sort of performance improvement I
could get over two interpreted languages I use quite frequently,
Python (version 1.5.2) and Perl (version 5.004_04).
For comparison, I used the following Perl program:
my $count = 0;
while (<STDIN>) {
$count++;
}
print "$count\n";
and the following Python program:
import sys, string
def main():
readline = sys.stdin.readline
split = string.split
count = 0
while 1:
line = readline()
if not line: break
l = split(line, '|')
count = count + 1
print count
main()
As can be seen, to substitute for the third part of the RSP idiom I
simply coded a count of the number of lines processed.
Surprisingly, to me, the Python version far outperformed the Perl
version. Running on 1 million lines of input of 9 fields each, the
Python version ran in just under 16 seconds, the Perl version in just
under 40 seconds (this on a 400Mhz Pentium Linux box). [Author's note:
this is a ratio of 2.5 to 1.]
For the C++ split function, the first time through I tried a very
straightforward approach using the C++ string and vector classes (I'm
using the egcs compiler, version 1.1b). The split function prototype
is
int split(vector<string>&, const string&, char);
(I've posted the implementation of the split functions below) and the
main function is:
main() {
string line;
vector<string> v;
int count = 0;
while (getline(cin, line)) {
split(v, line, '|');
++count;
}
cout << count << endl;
}
This ran in about 8.5 seconds, (21% of Perl, 54% of Python). But, this
wasn't the sort of speedup I had in mind, so I kept plowing ahead.
Previous tests of mine comparing the getline() version that works
with C++ strings and that which works with character arrays showed me
that the latter was much faster, albeit less flexible. So, keeping
the general split algorithm intact, I recoded the split function:
int split(vector<string>&, const char*, char);
and the main routine:
main() {
char line[1024];
vector<string> v;
int count = 0;
while (cin.getline(line, 1024)) {
split(v, line, '|');
++count;
}
cout << count << endl;
}
This ran in about 4.75 seconds (12% of Perl, 30% of Python). Better, but
I still wanted more. So I devised a split routine to operate on character
arrays, but to create a vector of "spans" instead of strings:
typedef pair<const char*, const char*> span;
int split(vector<span>&, const char*, char);
and a new main routine:
main() {
char line[1024];
vector<span> v(30);
int count = 0;
while (cin.getline(line, 1024)) {
split(v, line, '|');
++count;
}
cout << count << endl;
}
This ran in 1.2 seconds (3% of Perl, 7.5% of Python), which sounded about
where I wanted the performance to be.
For comparison purposes, here are the three split routines:
// Split a string on a given character into a vector of strings
// The vector is passed by reference and cleared each time
// The number of strings split out is returned
int split(vector<string>& v, const string& str, char c)
{
v.clear();
string::const_iterator s = str.begin();
while (true) {
string::const_iterator begin = s;
while (*s != c && s != str.end()) { ++s; }
v.push_back(string(begin, s));
if (s == str.end()) {
break;
}
if (++s == str.end()) {
v.push_back("");
break;
}
}
return v.size();
}
// Split a NULL-terminated character array on a given character into
// a vector of strings
// The vector is passed by reference and cleared each time
// The number of strings split out is returned
int split(vector<string>& v, const char* s, char c)
{
v.clear();
while (true) {
const char* begin = s;
while (*s != c && *s) { ++s; }
v.push_back(string(begin, s));
if (!*s) {
break;
}
if (!*++s) {
v.push_back("");
break;
}
}
return v.size();
}
// Represents a span of a character array
typedef pair<const char*, const char*> span;
// Convenience function to set a span
inline void set_span(span& s, const char* b, const char* e)
{
s.first = b;
s.second = e;
}
// Split a NULL-terminated character array on a given character into
// a vector of spans
// The vector is of constant size is not cleared each time
// The number of spans split out is returned
int split(vector<span>& v, const char* s, char c)
{
int i = 0;
while (true) {
const char* begin = s;
while (*s != c && *s) { ++s; }
set_span(v[i++], begin, s);
if (!*s) {
break;
}
if (!*++s) {
set_span(v[i++], 0, 0);
break;
}
}
return i;
}
Each of these routines will split the following string:
<field0>|<field1>|<field2>| ... |<fieldN>
where any of the N fields can be empty, into N+1 fields.
So, what do folks think of this? Obviously, the particular span
approach above has the drawback that is relies both on a fixed-length
input buffer and a fixed-length vector of spans. However, for my
purposes, this is safe enough and speed is paramount at this time.
I'm sure there are better ways to approach this, issues I haven't
thought of, speed gains to be found here and there, etc. I'd be glad
to hear of them.
Bill
--
William S. Lear | Who is there that sees not that this inextricable labyrinth
z o p y r a . | should understand their own affairs, and, understanding,
c o m | become inclined to conduct them? ---William Godwin, 1793