efficiency in data-processing perl script ... 
Author Message
 efficiency in data-processing perl script ...

I'm haven't written much perl and what I have written has been for things
where speed isn't much of an issue.  Recently though, I have been doing
some work where I have to process a few hundred MB's of data in chunks
of up to (so far :-) 170MB.

One of my scripts runs a lot slower than I would like, and I think it is
because it isn't growing it's hash table big enough to avoid linear search
behavior.

Basically, the script loops over the contents of a file reading binary
records. Every record is either a malloc or a free (these are malloc
traces) for an address.  If it's a free, then I look up the stuff from
when I saw that address as a malloc and write output.  I use an associative
array to do the mapping from addresses to data values.

This just takes a lot longer than I think it should and the thing never uses
more than 1MB of space even though I shove huge traces through it with
many "objects" being alive at the same time (so I expect the hash table
to have to be pretty big).

Any hints/help appreciated!

David Boles

#
# David Boles  --  Feb. 4, 1994
#
# This filter takes binary input records of the form:
#
#   size:Int32, address:UInt32, time_sec:Int32, time_usec:Int32
#
# and outputs binary records of the form:
#
#   identity:Int32, size:Int32, time_sec:Int32, time_usec:Int32
#
# where identity is determined by the address and is created when
# that address is malloc'd.  The later free of that address refers
# to the same object identity but a subsequent malloc for that
# address refers to a new object and is granted a new identity.
#
# Whether the operation is a malloc or free is encoded in the sign
# of the size.  If the size is -1, then this a free, otherwise it is
# an allocation.
#

$RecordSize = 16;
$ObjectCount = 0;

print "Input file is: $ARGV[0]\n";
print "Output file is: $ARGV[1]\n";

open(INPUT,  "<$ARGV[0]") || die "Error opening $ARGV[0]";
open(OUTPUT, ">$ARGV[1]") || die "Error opening $ARGV[1]";
binmode(INPUT);
binmode(OUTPUT);

while (read(INPUT, $Record, $RecordSize)) {

    if ($ObjectTable{$Array[1]} != 0) {

        # This was a live object, make sure we're freeing it.
        if ($Array[0] != -1) {
            print "Error in data, allocating already allocated object.\n";
            exit(1);
        } else {
            # Undo shift by 1 ...
            $index = $ObjectTable{$Array[1]} - 1;

            $Output = pack("iill", $index, $Array[0],
                           $Array[2], $Array[3]);

            # Mark as available
            $ObjectTable{$Array[1]} = 0;
        }
    } else {
        if ($Array[0] == -1) {
            printf "Attempting to free new object: %x\n", $Array[1];
            printf "Table value was: %x\n", $ObjectTable{$Array[1]};
        }

        $Output = pack("iill", $ObjectCount, $Array[0],
                       $Array[2], $Array[3]);

        # Save away the index for later (shifted by 1 because nil value == 0)
        $ObjectTable{$Array[1]} = $ObjectCount + 1;

        $ObjectCount++;
    }
    $RecordCount++;

    # Write out the record
    print OUTPUT $Output;

Quote:
}



Mon, 19 Aug 1996 07:29:07 GMT  
 efficiency in data-processing perl script ...
: I'm haven't written much perl and what I have written has been for things
: where speed isn't much of an issue.  Recently though, I have been doing
: some work where I have to process a few hundred MB's of data in chunks
: of up to (so far :-) 170MB.
:
: One of my scripts runs a lot slower than I would like, and I think it is
: because it isn't growing it's hash table big enough to avoid linear search
: behavior.

In general that can't happen, unless someone broke your hash function.

The only possibility I can think of would point to a problem in your
malloc.  I deduce from your calls to binmode that you may be running
on something other than Unix.  It could be that your malloc won't let
you allocate anything larger than 64K.

The most buckets it could get in 64K would be 16384, assuming 4 bytes
per bucket pointer.  If the max allocation is only 65535, you might
only get 8192.  Still, that's a lot of buckets.  You don't say how many
simultaneous objects you're dealing with.

The linear bucket overflow scan is pretty fast, since it mostly just
compares hash values, and only compares strings if the 32-bit hash
values match.  If your overflow chains are in the neighborhood of 3 or
4 long, you aren't losing much.  If they're 30 or 40, you're getting
into trouble.

If you print scalar(%ObjectTable), it should say how many buckets it's
using out of how many allocated.  Now if you get a bucket usage of
1/16384, you know your hash function is broken.  You should get a ratio
much closer to 1 than 0.

: Basically, the script loops over the contents of a file reading binary
: records. Every record is either a malloc or a free (these are malloc
: traces) for an address.  If it's a free, then I look up the stuff from
: when I saw that address as a malloc and write output.  I use an associative
: array to do the mapping from addresses to data values.
:
: This just takes a lot longer than I think it should and the thing never uses
: more than 1MB of space even though I shove huge traces through it with
: many "objects" being alive at the same time (so I expect the hash table
: to have to be pretty big).
:
: Any hints/help appreciated!

You may be getting eaten up by conversions.  Since you're using an
associative array, it would be faster to leave the key as a 4-byte
string, and never convert it to a number unless you need to print out
an error message.  (If you leave the key as a 4-byte string, you have
the added advantage that it doesn't have to work as hard to compute the
hash function.)  You might gain some more by not unpacking the last two
values to two longs, when all you do is pack them together again later.
Keeping those together as a single string might be faster.  If you can,
don't unpack them at all, but fetch them out of $Record at the last
moment, or better, don't fetch them out at all, but turn $Record into what
you want by assigning the first two values to substr($Record,0,8).
Or don't even bother with that.  Put the print statements directly
into the conditionals, and avoid the extra assignments.

It's also a little bit inefficient to write

        $ObjectTable{$Array[1]} = $ObjectCount + 1;
        $ObjectCount++;

when you could just write

        $ObjectTable{$Array[1]} = ++$ObjectCount;

Hope some of this helps.

Larry



Tue, 20 Aug 1996 03:25:22 GMT  
 efficiency in data-processing perl script ...
Part of your problem might be that you never delete anything
from your associative array.

Rather than doing

Quote:
>    if ($ObjectTable{$Array[1]} != 0) {
>        # This was a live object, make sure we're freeing it.

try
        if (defined($ObjectTable{$Array[1]})) {
                # this object is live

This also allows you to use the real value, instead of shifting
by 1 to ensure that 0 is magic (which is a minor detail).

Then to "mark as available" you just delete the array entry.
Instead of

Quote:
>            # Mark as available
>            $ObjectTable{$Array[1]} = 0;

you write

                # Remove of table of active entries
                delete($ObjectTable{$Array[1]});

This way your associative array only ever has as many entries
in it as there are active malloc regions.

- David



Fri, 23 Aug 1996 22:07:19 GMT  
 
 [ 3 post ] 

 Relevant Pages 

1. help assessing efficiency of perl script, and testing file content size

2. Using data descriptors to seperate appl logic from data processing

3. HTML form and Perl script send data to another Perl script

4. script efficiency improvement

5. Improving Script Efficiency

6. pre-RFD: comp.lang.perl.{data-structure,inter-process,porters,regex}

7. Need a Perl Data Processing Example

8. RFD: comp.lang.perl.{data-structure,inter-process,programmer,regex}

9. AWK vs Perl For Misc Data Processing Tasks

10. pre-RFD: comp.lang.perl.{data-structure,inter-process,porters,regex}

11. a question on data exchange between Perl processes

12. Multi-threaded/process perl and sharing data?

 

 
Powered by phpBB® Forum Software