Tied hash not scaling - advice? 
Author Message
 Tied hash not scaling - advice?

I have a series of scripts, designed to help our operations center deal
with spam complaints.  As part of this, it creates an audit trail.  These
are arranged like: <auditdir>/6.29.99/209.162.144.3.blocked.  

I've got a CGI tool that can search for a particular IP in the audit trail.
As time went by, using File::find for this got really slow, unsurprisingly.
So, I learned about tied hashes, and made an index file.  All was well,
and it was blazingly fast.

Much more time has passed, and now it's completely broken.  And when I
look at the index file:
-rw-r--r--   1 kirbyk   user     594935808 Jun 28 13:55 audit.index

Yikes!  That's one big file!

I've deleted it and regenerated it, and it's still doing this.

So, clearly, I need to change something.  I'm not sure if there's an
easy solution, or if I'll have to dig in and run something like a
mySQL database for the backend.  It's really useful to be able to pull
up the reason we're blocking someone when an angry ISP is on the other
line. :-)

Here's the (pared down slightly) code. It worked, once upon a time:
#! /usr/local/bin/perl -w

use File::Find;
use POSIX;
use DB_File;

$auditDir = "/home/netbuild/postmaster/spam/audit";
$indexFile = "$auditDir/audit.index";
if (-e $indexFile) {
    #if indexFile exists, don't start from scratch

    $date[4]++;
    $auditDir .= "/$date[4].$date[3].$date[5]";
    }

tie %files, "DB_File", $indexFile, O_CREAT|O_RDWR, 0666, $DB_HASH;

find (\&add_file, $auditDir);

untie %files;
exit 0;

sub add_file {

    $file = $File::Find::name;
    return if -d $file;
    return if $file eq $indexFile;
    ($site) = ($file =~ /.*\/.*\/(.*)$/);
    ($ip) = ($site =~ /(\d*\.\d*\.\d*\.\d*).*/);
    return if !$ip;
    $fileList = $main::files{$ip};
    return if $fileList && $fileList =~ /$file/;
    $fileList .= " $file";
    $main::files{$ip} = $fileList;

Quote:
}

--

<*> Lips that taste of tears, they say, are the best for kissing - D. Parker


Sat, 15 Dec 2001 03:00:00 GMT  
 Tied hash not scaling - advice?

Quote:

>I've got a CGI tool that can search for a particular IP in the audit trail.
>As time went by, using File::find for this got really slow, unsurprisingly.
>So, I learned about tied hashes, and made an index file.  All was well,
>and it was blazingly fast.

>Much more time has passed, and now it's completely broken.  And when I
>look at the index file:
>-rw-r--r--   1 kirbyk   user     594935808 Jun 28 13:55 audit.index

>Yikes!  That's one big file!

>I've deleted it and regenerated it, and it's still doing this.

>So, clearly, I need to change something.  I'm not sure if there's an
>easy solution, or if I'll have to dig in and run something like a
>mySQL database for the backend.  It's really useful to be able to pull
>up the reason we're blocking someone when an angry ISP is on the other
>line. :-)

[snip]

Quote:
>    $fileList = $main::files{$ip};
>    return if $fileList && $fileList =~ /$file/;
>    $fileList .= " $file";
>    $main::files{$ip} = $fileList;

At a guess, you are using Berkeley DB 1.x. I've provoked core dumps
using 1.x and code similiar to the lines above - in addition to having
it swallow the disk wholesale (and wastefully). That was just
doing something like

 for (my $count=1;$count<10000;$count++) {
        my $value = ' ' x $count;
        $hash{'0'} = $value;
 }

It exploded into the tens of megabytes and core dumped at slightly over
5000 on my machine.  Berkeley 1.x *DOESN'T LIKE* progressively extending
the length of values. The same program on 2.x only ran in to a few dozen
Kbytes of storage and ran perfectly.

The best fix is to get Berkeley DB 2.x installed. Work arounds could
include 'pre-allocating' large records or making *two* passes - one
to determine the size of the needed records and the second to actually
do the load with the storage being pre-allocated to the final size
and using 'substr' to slot in each record.

Another possibility is to use 'Search::InvertedIndex' from CPAN.
Its 'preload' methods *don't* progressively grow the records and
so shouldn't tickle the bug in Berkley DB 1.x. And it runs in NlogN
time for loads where your approach runs in N^2 time during loads.

--
Benjamin Franz



Sun, 16 Dec 2001 03:00:00 GMT  
 Tied hash not scaling - advice?

Quote:
>Much more time has passed, and now it's completely broken.  And when I
>look at the index file:
>-rw-r--r--   1 kirbyk   user     594935808 Jun 28 13:55 audit.index
>sub add_file {

>    $file = $File::Find::name;
>    return if -d $file;
>    return if $file eq $indexFile;
>    ($site) = ($file =~ /.*\/.*\/(.*)$/);
>    ($ip) = ($site =~ /(\d*\.\d*\.\d*\.\d*).*/);
>    return if !$ip;
>    $fileList = $main::files{$ip};
>    return if $fileList && $fileList =~ /$file/;
>    $fileList .= " $file";
>    $main::files{$ip} = $fileList;

>}

From my own (limited) experience with this one, I think there is a bug
in certain DB_File db databases when the values are repeatedly
modified by extending their length. It seems that maybe the db file
then grows by adding on the new value to its file-on-disk tables
without recycling the (sizeable) disk space of the old value in the
file!

There ought to be some way to force a 'database  compaction ' for this
'garbage collection'  problem but I do not know how. You might try to
copy the database record by record to another (new) tied db_file hash
and see what the size of the resulting new file is then (and compare
the file contents of old and new).

---
The above from: address is spamblocked. Use wherrera (at) lynxview (dot) com for the reply address.



Sun, 16 Dec 2001 03:00:00 GMT  
 Tied hash not scaling - advice?

[snip...]

Quote:
>At a guess, you are using Berkeley DB 1.x. I've provoked core dumps
>using 1.x and code similiar to the lines above - in addition to having
>it swallow the disk wholesale (and wastefully). That was just
>doing something like

> for (my $count=1;$count<10000;$count++) {
>    my $value = ' ' x $count;
>    $hash{'0'} = $value;
> }

>It exploded into the tens of megabytes and core dumped at slightly over
>5000 on my machine.  Berkeley 1.x *DOESN'T LIKE* progressively extending
>the length of values. The same program on 2.x only ran in to a few dozen
>Kbytes of storage and ran perfectly.

>The best fix is to get Berkeley DB 2.x installed. Work arounds could
>include 'pre-allocating' large records or making *two* passes - one
>to determine the size of the needed records and the second to actually
>do the load with the storage being pre-allocated to the final size
>and using 'substr' to slot in each record.

>Another possibility is to use 'Search::InvertedIndex' from CPAN.
>Its 'preload' methods *don't* progressively grow the records and
>so shouldn't tickle the bug in Berkley DB 1.x. And it runs in NlogN
>time for loads where your approach runs in N^2 time during loads.

I've recently encountered the bug in Berkley DB 1.x and the bug is
actually tickled by *deleting* overflow records where overflow records
can be created by either having a key/data pair too large to fit in
the current record, or having a large number of keys with the same hash.
Unfortunantely, updating a record with the same key consists internally
of a delete followed by an add (look inside the function hash_access()
inside the file hash.c). This means that it's just a matter of time
whenever you deal with a large Berkley DB 1.x file for the bug to be
tickled. They only work-around I've seen for 1.x is to only write new
records and to never delete or update any existing records.  Ok if
you create DB files from flat files for speed of access and always
recreate the DB file whenever the flat file is changed. Useless for more
general purposes. I strongly urge you to upgrade to Berkley DB 2.x or
GNU gdbm.

Later,
John Cochran



Sun, 16 Dec 2001 03:00:00 GMT  
 Tied hash not scaling - advice?

Quote:

>The best fix is to get Berkeley DB 2.x installed.

Is this something that CPAN.pm can do? If so, which CPAN module
installs DB_File with db 2.0 ?

---
The above from: address is spamblocked. Use wherrera (at) lynxview (dot) com for the reply address.



Mon, 17 Dec 2001 03:00:00 GMT  
 Tied hash not scaling - advice?

Quote:



>>The best fix is to get Berkeley DB 2.x installed.

>Is this something that CPAN.pm can do? If so, which CPAN module
>installs DB_File with db 2.0 ?

No. When you install Perl, it automatically build DB_File against
the installed DBs. To get BDB 2.x, you need to go to
<URL:http://www.sleepycat.com/> and get the latest from there.

*Then* re-install DB_File from CPAN (or if you aren't using the
latest stable Perl of 5.005_03, upgrade your Perl - it will link
against your newly installed BDB 2.x during install).

One note - BDB2.x is *not* compatible with BDB1.x by default.
Pay attention to the options for 185 compatibility if you have
old Berkely DBs you want to keep the data from.

--
Benjamin Franz



Tue, 18 Dec 2001 03:00:00 GMT  
 
 [ 6 post ] 

 Relevant Pages 

1. bizarre(?) problem, tied hash of tied hashes

2. Tie::Hash EXISTS not implemented?

3. IPC::Shareable not sharing tied hashes

4. IPC::Shareable not sharing tied hashes

5. tie %HASH to DB_File not working

6. BETA TEST: Tk::Tie::MenuHash - Ties a Tk::Menubutton widget to a hash, kinda

7. BETA TEST: Tk::Tie::MenuHash - Ties a Tk::Menubutton widget to a hash, kinda

8. PERL coredumps with tie but not tied

9. PERL coredumps with tie but not tied

10. Need advice on a project (wrt to tie'ing to a file and general strategy)

11. Need advice on a project (wrt to tie'ing to a file and general strategy)

12. tie a hash in a hash

 

 
Powered by phpBB® Forum Software