Need advice on a project (wrt to tie'ing to a file and general strategy) 
Author Message
 Need advice on a project (wrt to tie'ing to a file and general strategy)

Last week I posted the article pasted hereafter. Since I haven't
received much feedback, I'm posting it again. Sorry for any
disturbance this might cause...

### OP
A few years ago I wrote a shell script to find duplicates in a
directory tree and remove them (keeping only the 1st one wrt
alphabetical order).

My script, though gross and primitive, has worked reliably for me for
a long time, but now I'd like to rewrite it in perl with some
enhancements too, one difference being that it should run also (and
possibly, mainly) on Windows.

Let me explain how it did work first. Necessary conditions for two
files to be identical are
(i) They have the same size,
(ii) They have the (a) same checksum.

I assume that (i) AND (ii) are also a sufficient condition, even if
that's simply not true, that is, I don't mind extremely rare cases of
different files having both same size and checksum.

My script worked by calculating (how is a detail) the file sizes of
files in the directory tree, sorting by the sizes and comparing the
checksums of files with the same size. To be precise, fwim, I used the
md5sum program to calculate checksums.

I used the above strategy because calculating a checksum is always
more time-consuming than printing a file size.

So, the first question regards how to calculate checksums with perl:
do I have to resort to external programs (thus, possibly, having to
use a Windows port of a *nix one for compatibility) or is there a
(recommended) module suitable for this task?

Also, I know (more than) one way to make such a program, but an
important consideration to take into account is that now I should run
it on huge (for me) "data sets" (that is of the size of 10^5 files).

I know from FAQs, etc. that a solution is to tie to a file (whatever I
will use: array, hash...), but there are many modules designed for
that. Which one do you recommend? I'd prefer a KISS solution, that is,
I know that some of them use some sort of database libraries
transparently, but it seems to me an overhead.

In case I decide to use a DB-oriented module, are there any
portability issues wrt the abovementioned libraries (e.g. are they
available and work reliably on "all" platforms?).

Also, I would like to cache the data relative to size/checksums when
they are calculated, so to "include them on the list" when running the
program subsequent times. Notice that I'm aware this strategy won't
catch all duplicates between one run and the other, but I'm sure it
would be nevertheless an enhancement for my purposes...
### END OF OP

For completeness sake I paste here also my answer to one reply I
actually got:

### ANSWER


Quote:
>As pointed out in the previous post, there's Digest::MD5 on CPAN for your
>need. However, this beast is very slow (not because of Perl, it's the
>algorithm), but good to bring the probability of collision of two distinct
>file close to zero.

Don't laugh, but I seem to remember (not sure myself) that actually I
had chosen md5sum for some "ease of use" due to its fixed-length
output wrt, e.g. 'cksum'.

Quote:
>If you don't mind the chance of two distinct files with the same filename
>and checksum, and like to speed up your checksum computation significantly,
>you might consider using CRC32. Correct me if I'm wrong, the collision
>probability of CRC32 is 2^(-32) which is approximately one out of 4 billion
>trials. Might be good enough for you application.

Indeed it would be good enough for my application. Now is there a
CRC32 module for perl? (OK, I'll search CPAN...)

Quote:
>If you like to keep things in memory, you might want to give an estimate on
>how much memory you need to hold the whole thing. This varies widely
>depending on the average length of the filenames, average length of string
>of size of files and the number of files. (I assume that you'd like to keep
>a hash keyed on the filesize with values being filenames), plus a fair

I was rather thinking of a hash keyed on the filesize with values
being references to arrays whose first entries are filenames and the
second ones, if present, are the cksums.

And I would keep those entries that have a checksum setting the
filename to "" (multiple ones thus collapsing into a single one) for
subsequent use, as I explained in the previous post.

Quote:
>estimate = number of files * (average length of filenames + average length
>of string of size of files) + 30 % extra (Perl and checksum computation).

I have total sum of (length of filenames + length of string of size of
files) = 6334Kb in one particular test.
### END OF ANSWER

Michele
--
Liberta' va cercando, ch'e' si' cara,
Come sa chi per lei vita rifiuta.
           [Dante Alighieri, Purg. I, 71-72]

I am my own country - United States Confederate of Me!
           [Pennywise, "My own country"]



Fri, 08 Apr 2005 21:41:04 GMT  
 Need advice on a project (wrt to tie'ing to a file and general strategy)

Quote:

> Last week I posted the article pasted hereafter. Since I haven't
> received much feedback,

If the feedback answers the question, then a single followup is "enough".

Did the previous feedback not answer your question?

Quote:
> I'm posting it again.

Why?

If what was suggested there won't work for you, then share why
it won't work for you.

Quote:
> Sorry for any
> disturbance this might cause...

What do you hope to gain this time that didn't happen the first time?

Quote:

>>If you don't mind the chance of two distinct files with the same filename
>>and checksum, and like to speed up your checksum computation significantly,
>>you might consider using CRC32. Correct me if I'm wrong, the collision
>>probability of CRC32 is 2^(-32) which is approximately one out of 4 billion
>>trials. Might be good enough for you application.

> Indeed it would be good enough for my application.

Looks like your problem has already been solved.

Is there something else that you meant to mention this time around?

--
    Tad McClellan                          SGML consulting

    Fort Worth, Texas



Fri, 08 Apr 2005 22:09:19 GMT  
 Need advice on a project (wrt to tie'ing to a file and general strategy)

Quote:

> Last week I posted the article pasted hereafter. Since I haven't
> received much feedback, I'm posting it again. Sorry for any
> disturbance this might cause...

Parts of your post were answered.  Instead of reposting, you ought to
rewrite your post, with just the parts that you didn't get answers on.

[snip]

Quote:
> So, the first question regards how to calculate checksums with perl:
> do I have to resort to external programs (thus, possibly, having to
> use a Windows port of a *nix one for compatibility) or is there a
> (recommended) module suitable for this task?

This was answered.

Quote:
> Also, I know (more than) one way to make such a program, but an
> important consideration to take into account is that now I should run
> it on huge (for me) "data sets" (that is of the size of 10^5 files).

Store the data in a database -- either one created with a tied hash,
using one of the modules that comes with perl (DB_File or SDBM_File) or
using a "real" database, with DBI, and one of access, or oracle, or
SQLite, etc.

Quote:
> I know from FAQs, etc. that a solution is to tie to a file (whatever I
> will use: array, hash...), but there are many modules designed for
> that. Which one do you recommend?

Whichever one is easiest to use, and will work on your data.

You probably already have SDBM_File or DB_File installed -- if these
work, then use them.

Quote:
> I'd prefer a KISS solution, that is, I know that some of them use some
> sort of database libraries transparently, but it seems to me an
> overhead.

> In case I decide to use a DB-oriented module, are there any
> portability issues wrt the abovementioned libraries (e.g. are they
> available and work reliably on "all" platforms?).

Read perldoc AnyDBM_File, and scroll down a bit.

Quote:
> Also, I would like to cache the data relative to size/checksums when
> they are calculated, so to "include them on the list" when running the
> program subsequent times.

That's easy enough -- don't delete the file that you tie your hash to,
or if using a real database, don't make your table(s) temporary.

Quote:
> Notice that I'm aware this strategy won't catch all duplicates between
> one run and the other, but I'm sure it would be nevertheless an
> enhancement for my purposes...

Why wouldn't it?

Suppose that you implement your remove-duplicates thing as follows:

# or SDBM_File, or whatever.
tie my(%hash), "DB_File", "digests", ....;

# Remove modified items from the cache.
while( my ($digest, $tf) = each %hash ) {
   my ($timestamp, $filename) = unpack("Na*", $tf);
   delete $hash{$digest}
      unless -e $filename and $timestamp == (stat _)[9];

Quote:
}

# identify duplicates.
find( { no_chdir => 1, wanted => sub {
   next unless -f;
   open( my ($fh), "<", $_ ) or warn("$_ : $!"), next;
   binmode $fh;
   my $digest = Digest::MD5->new->addfile($fh)->digest;
   my $tf = pack("Na*", scalar((stat $fh)[9]), $_);
   close $fh;
   if( my $old = $hash{$digest} ) {
      return if $tf eq $old;
      $hash{$digest} = $tf, return
         if $_ eq (my $f = substr $old, 4);
      print "$_ and $f have the same MD5 checksum\n"; # unlink($_)?
   } else {
      $hash{$digest} = $tf;
   }

Quote:
} }, $path );

[untested]

The no_chdir means I can just use $_ instead of $File::Find::name
everywhere.

There's no need to check for lengths being the same, since MD5 checksums
also mix in the length of the data.

I stat $fh instead of $_ because it's usually faster to stat a
filehandle than a filename.  This may differ on other systems than mine,
though.

--
my $n = 2; print +(split //, 'e,4c3H r ktulrnsJ2tPaeh'
."\n1oa! er")[map $n = ($n * 24 + 30) % 31, (42) x 26]



Sat, 09 Apr 2005 05:12:01 GMT  
 Need advice on a project (wrt to tie'ing to a file and general strategy)

Quote:

>So, the first question regards how to calculate checksums with perl:
>do I have to resort to external programs (thus, possibly, having to
>use a Windows port of a *nix one for compatibility) or is there a
>(recommended) module suitable for this task?

I use Digest::MD5 for this purpose. Much more reliable than a simple
CRC32 checksum, because it's a 128 bit checksum instead of 32.

General:        <http://search.cpan.org/author/GAAS/Digest-MD5-2.20/>
Win32:  <http://www.activestate.com/PPMpackages/5.6plus/>

The latter is version 2.16, the former 2.20.

Quote:
>In case I decide to use a DB-oriented module, are there any
>portability issues wrt the abovementioned libraries (e.g. are they
>available and work reliably on "all" platforms?).

A tied hash 5berkely style DB) is IMO an excellent choice for this kind
of application. See `perldoc AnyDBM_File` for a generic intro. In
general, the database itself cannot be moved to another platform,
because of binary incompatibilities, but the data is meaningless on
another platform anyway. At least the code will be portable.

--
        Bart.



Sat, 09 Apr 2005 08:25:06 GMT  
 Need advice on a project (wrt to tie'ing to a file and general strategy)

Quote:

>Did the previous feedback not answer your question?

It didn't answer all of my questions.

Quote:
>> I'm posting it again.

>Why?

Because I still have doubts.

Quote:
>If what was suggested there won't work for you, then share why
>it won't work for you.

Because there's nothing that actually "won't work" for me. Only I
still haven't got enough details to make up my mind.

Quote:
>> Sorry for any
>> disturbance this might cause...

>What do you hope to gain this time that didn't happen the first time?

more answers?

Seriously, I apologize once more for my behaviour, but I actually got
further (and more useful, IMHO) information from other two posts in
this thread.

Michele
--
Liberta' va cercando, ch'e' si' cara,
Come sa chi per lei vita rifiuta.
           [Dante Alighieri, Purg. I, 71-72]

I am my own country - United States Confederate of Me!
           [Pennywise, "My own country"]



Sun, 10 Apr 2005 07:29:03 GMT  
 Need advice on a project (wrt to tie'ing to a file and general strategy)


Quote:
>A tied hash 5berkely style DB) is IMO an excellent choice for this kind
>of application. See `perldoc AnyDBM_File` for a generic intro. In
>general, the database itself cannot be moved to another platform,
>because of binary incompatibilities, but the data is meaningless on
>another platform anyway. At least the code will be portable.

Sorry, but I'm really ignorant here. I thought that to use the DB_File
module I had to get a system library and was not sure if it were
available under non-*nix systems.

Actually I was surprised to find that Mrs Activestate's ppm gently
provided me with a ready to use dll...

However, I can accept that the database cannot be moved to another
platform (what else can I do?), but I really can't understand the
reason why. That is, for example, jpeg is a binary format too, but it
can be written both on unices and on Windoze and on many other OSes!

So, why isn't there a cross-platform db format? Also, what does it
mean that "data is meaningless on another platform anyway": in my case
I would have to do with paths, numbers and checksums. Nothing too
{*filter*} or platform-dependent (directory separator apart, of course,
but not too much also in this case for what regards perl) IMHO.

Michele
--
Liberta' va cercando, ch'e' si' cara,
Come sa chi per lei vita rifiuta.
           [Dante Alighieri, Purg. I, 71-72]

I am my own country - United States Confederate of Me!
           [Pennywise, "My own country"]



Sun, 10 Apr 2005 07:29:05 GMT  
 Need advice on a project (wrt to tie'ing to a file and general strategy)
On Tue, 22 Oct 2002 00:12:01 -0400, Benjamin Goldberg

Quote:

>Store the data in a database -- either one created with a tied hash,
>using one of the modules that comes with perl (DB_File or SDBM_File) or
>using a "real" database, with DBI, and one of access, or oracle, or
>SQLite, etc.

I'm currently experimenting with DB_File. This is an 'easy' enough
solution, from a newbie's point of view.

I still have some questions: as you might have read I had thought of
using a hash whose values are array refences. In any case it seems
that I can't have a tied hash of arrays, can I?!? In connection with
this problem, would it be better to use a "real" database?

Quote:
>> Notice that I'm aware this strategy won't catch all duplicates between
>> one run and the other, but I'm sure it would be nevertheless an
>> enhancement for my purposes...

>Why wouldn't it?

Most probably I wasn't clear. I was thinking of an algorithm taking
checksums only of files with the same size. Thus if I run it today it
will catch all duplicates in one particular basedir; then I will
"process" those files somehow, but I will keep track of the pairs
(size, checksum) for those files for which the checksum was
calculated. Tomorrow I will run it again and I'll catch tomorrow's
duplicates along with files that were already duplicates today.

But my algorithm won't spot that a certain file that I have today is
the same as one that I will get tomorrow if that particular file isn't
duplicate today. However it suffices for me and is suitable for the
application I need it for.

Quote:
>Suppose that you implement your remove-duplicates thing as follows:

[code snipped]

I'll have to look at the code carefully and it will take some time.
For sure it will give some insight and some ideas.

Quote:
>There's no need to check for lengths being the same, since MD5 checksums
>also mix in the length of the data.

My original, gross, shell script used the algorithm roughly described
above because taking a file size is *much* faster than taking a
checksum and two files having different sizes is a sufficient
condition for them to be different (mathematically, it's an
"invariant", isn't it?).

I must add that more or less by chance I found the
File::Find::Duplicates module. For the moment I've neither tried it
nor have I watched its code (it could be hard, nay, it *will* be
hard), but the manpage says: "it returns a hash, keyed on filesize, of
lists of the identical files of that size", so the basic idea must be
the same as mine (and this makes me somehow proud). Only I *think*
that files are compared directly, because there's no mention of
checksums...

Quote:
>I stat $fh instead of $_ because it's usually faster to stat a
>filehandle than a filename.  This may differ on other systems than mine,
>though.

Useful to know...

Michele
--
Liberta' va cercando, ch'e' si' cara,
Come sa chi per lei vita rifiuta.
           [Dante Alighieri, Purg. I, 71-72]

I am my own country - United States Confederate of Me!
           [Pennywise, "My own country"]



Sun, 10 Apr 2005 07:29:07 GMT  
 Need advice on a project (wrt to tie'ing to a file and general strategy)
On Wed, 23 Oct 2002 08:29:05 +0200,

Quote:


>>A tied hash 5berkely style DB) is IMO an excellent choice for this kind
>>of application. See `perldoc AnyDBM_File` for a generic intro. In
>>general, the database itself cannot be moved to another platform,
>>because of binary incompatibilities, but the data is meaningless on
>>another platform anyway. At least the code will be portable.

> Sorry, but I'm really ignorant here. I thought that to use the DB_File
> module I had to get a system library and was not sure if it were
> available under non-*nix systems.

> Actually I was surprised to find that Mrs Activestate's ppm gently
> provided me with a ready to use dll...

> However, I can accept that the database cannot be moved to another
> platform (what else can I do?), but I really can't understand the
> reason why. That is, for example, jpeg is a binary format too, but it
> can be written both on unices and on Windoze and on many other OSes!

The DB library uses platform specifics in the database. Things like
the endianness of the machine, and the size of ints. That means a
database created on one machine isn't necessarily readable on another.

Quote:
> So, why isn't there a cross-platform db format? Also, what does it
> mean that "data is meaningless on another platform anyway": in my case
> I would have to do with paths, numbers and checksums. Nothing too
> {*filter*} or platform-dependent (directory separator apart, of course,
> but not too much also in this case for what regards perl) IMHO.

One reason *DBM isn't made cross platform is that the performance
trade off (no longer being able to directly copy data from the file
to memory) isn't considered worth the benefit. After all you can
just read the data and write it out in a transport format (text for
example) and create a *DBM database in on the target machine. Well
I think anyway, not being the designer/author I'm just guessing.

--
Sam Holden



Sun, 10 Apr 2005 09:02:37 GMT  
 Need advice on a project (wrt to tie'ing to a file and general strategy)

Quote:

>I still have some questions: as you might have read I had thought of
>using a hash whose values are array refences. In any case it seems
>that I can't have a tied hash of arrays, can I?!?

Yes you can. There's a module especially for this, I forgot the name...
something with 5 letters. MLDBM? Yup, that appears to be it.

        <http://search.cpan.org/search?query=mldbm&mode=all>

The idea is that the multilevel data, like this array in the hash, is
"serialized", i.e. turned into a plain string. Modules like Storable and
Data::Dumper can be used for that. That way, the result is just a
string, which can be saved in the old, familiar way. In their bare form,
you can use these modules to save complex Perl data structures in an
ordinary file. MLDBM uses one of these behind the scenes, so you don't
have to worry about the details.

Hey, there's a thought: use Storable to save the entire (ordinary) hash
in a binary file! You need enough memory to keep all the data in memory
at once, but that may be feasable. And with the n* methods tostore the
data, thus: nstore(), you can even get portable files.

--
        Bart.



Sun, 10 Apr 2005 10:22:41 GMT  
 Need advice on a project (wrt to tie'ing to a file and general strategy)

Quote:

>> However, I can accept that the database cannot be moved to another
>> platform (what else can I do?), but I really can't understand the
>> reason why. That is, for example, jpeg is a binary format too, but it
>> can be written both on unices and on Windoze and on many other OSes!

>The DB library uses platform specifics in the database. Things like
>the endianness of the machine, and the size of ints. That means a
>database created on one machine isn't necessarily readable on another.

That's right. You can blame C, which is used to compile the database
engine for that. C tends to use pretty abstract names for its data
types, like int(), which physically are stored in the for the machine
most efficient way, but with no garantees at all about portability. So
all bets on endianness, or number of bits per integer, are off.

It is often possible to compile a database engine to use a portable file
format, but with a bad impact on efficiency. Speed could easily halve.

--
        Bart.



Sun, 10 Apr 2005 10:27:02 GMT  
 Need advice on a project (wrt to tie'ing to a file and general strategy)

Quote:

> On Tue, 22 Oct 2002 00:12:01 -0400, Benjamin Goldberg

> >Store the data in a database -- either one created with a tied hash,
> >using one of the modules that comes with perl (DB_File or SDBM_File)
> >or using a "real" database, with DBI, and one of access, or oracle,
> >or SQLite, etc.

> I'm currently experimenting with DB_File. This is an 'easy' enough
> solution, from a newbie's point of view.

> I still have some questions: as you might have read I had thought of
> using a hash whose values are array refences. In any case it seems
> that I can't have a tied hash of arrays, can I?!?

You can and you can't.  You can't, in that if you simply store an
arrayref in a tied database, it gets stringified as something like
"ARRAY(0x12345678)", and of course you can't go from this string back to
an array (especially not between seperate executions of perl).  You can,
in that if you install a filter (see perldoc perldbmfilter), you can
have your arrayrefs automatically serialized when stored in the database
and deserialized when retrieved.

Depending on what's in those array references you mentioned, serializing
can be easy or hard.

If they're strings (or numbers and strings), and you know that there're
no "\0" characters in any of them, you could do:


(tied %hash)->filter_fetch_value(sub { $_ = [split /\0/, $_, -1] });

If all the elements of each arrayref are numbers, you could do:


(tied %hash)->filter_fetch_value(sub { $_ = [unpack "N*", $_] });

If these arrays might contain more complicated structures, Storable.pm
and it's freeze() and thaw() subroutines might be appropriate.

Quote:
> In connection with this problem, would it be better to use a "real"
> database?

And how would a "real" database cope with the problem of storing arrays?

Quote:
> >> Notice that I'm aware this strategy won't catch all duplicates
> >> between one run and the other, but I'm sure it would be
> >> nevertheless an enhancement for my purposes...

> >Why wouldn't it?

> Most probably I wasn't clear. I was thinking of an algorithm taking
> checksums only of files with the same size.

Why?  It seems to me that you would have to keep a list of all the
different possible sizes, then.  Also, it sounds like you'd first need
to do one pass to find out what each of the different sizes are, and
then, for each different size, another pass to deal with (find checksums
of) all files of that particular size.  This sounds like many, many,
passes through the filesystem.

If you just make one pass through the filesystem, calculating checksums,
that should be sufficient.

Quote:
> Thus if I run it today it will catch all duplicates in one particular
> basedir; then I will "process" those files somehow, but I will keep
> track of the pairs (size, checksum) for those files for which the
> checksum was calculated. Tomorrow I will run it again and I'll catch
> tomorrow's duplicates along with files that were already duplicates
> today.

> But my algorithm won't spot that a certain file that I have today is
> the same as one that I will get tomorrow if that particular file isn't
> duplicate today. However it suffices for me and is suitable for the
> application I need it for.

This seems silly -- my algorithm will find *all* duplicates... well,
unless files are added or removed or changed while it's running, in
which case it might get confused, but that's unavoidable.

Quote:
> >Suppose that you implement your remove-duplicates thing as follows:

> [code snipped]

> I'll have to look at the code carefully and it will take some time.
> For sure it will give some insight and some ideas.

> >There's no need to check for lengths being the same, since MD5
> >checksums also mix in the length of the data.

> My original, gross, shell script used the algorithm roughly described
> above because taking a file size is *much* faster than taking a
> checksum and two files having different sizes is a sufficient
> condition for them to be different (mathematically, it's an
> "invariant", isn't it?).

Hmm.  I think I see what you're saying -- if there's only one file of a
particular size, then you can be certain that it's not a duplicate of
any other file.

However, for every size for which there's more than one file of that
size, you have to calculate a checksum of all of those files.  So
really, the only point of keeping track of file sizes is that little "if
there's only one file of this size" optomization.

Not that it's a bad optimization, of course, but it doesn't seem like
there's any way to use this and do only one pass through the filesystem.

Ok, here's what I would add to my prior code:

my %sizes;
# First pass, group by size.
find( { no_chdir => 1, wanted => sub {
   next unless -f;
   ++ $sizes{pack "N", -s _};

Quote:
} }, $path );

while( my ($size, $count) = each %sizes ) {
   delete $sizes{$size} if $count == 1;

Quote:
}

# identify duplicates.
find( { no_chdir => 1, wanted => sub {
   next unless -f;
   next unless exists $sizes{pack "N", -s _};
   .... everything else the same from here onward ...

--
my $n = 2; print +(split //, 'e,4c3H r ktulrnsJ2tPaeh'
."\n1oa! er")[map $n = ($n * 24 + 30) % 31, (42) x 26]



Mon, 11 Apr 2005 07:07:11 GMT  
 Need advice on a project (wrt to tie'ing to a file and general strategy)
On Thu, 24 Oct 2002 02:07:11 -0400, Benjamin Goldberg

Quote:

>> I still have some questions: as you might have read I had thought of
>> using a hash whose values are array refences. In any case it seems
>> that I can't have a tied hash of arrays, can I?!?

>You can and you can't.  You can't, in that if you simply store an
>arrayref in a tied database, it gets stringified as something like
>"ARRAY(0x12345678)", and of course you can't go from this string back to
>an array (especially not between seperate executions of perl).  You can,

This is exactly what I meant with "I can't".

Quote:
>in that if you install a filter (see perldoc perldbmfilter), you can
>have your arrayrefs automatically serialized when stored in the database
>and deserialized when retrieved.

>Depending on what's in those array references you mentioned, serializing
>can be easy or hard.

Easy, in my case.

Quote:
>> Most probably I wasn't clear. I was thinking of an algorithm taking
>> checksums only of files with the same size.

>Why?  It seems to me that you would have to keep a list of all the
>different possible sizes, then.  Also, it sounds like you'd first need
>to do one pass to find out what each of the different sizes are, and
>then, for each different size, another pass to deal with (find checksums
>of) all files of that particular size.  This sounds like many, many,
>passes through the filesystem.

>If you just make one pass through the filesystem, calculating checksums,
>that should be sufficient.

Well, my old, ineffiecient and error prone script used only one pipe
and one pass through the filesystem. The pass was done once for all
with 'find' (printing filesizes directly), then it sorted by filesize
and uniq'ed for printing duplicate entries only. Then checksums were
calculated on these entries and the stream was sorted again by the
checksums and uniq'ed for printing duplicates once again. And that's
all...

Quote:
>> My original, gross, shell script used the algorithm roughly described
>> above because taking a file size is *much* faster than taking a
>> checksum and two files having different sizes is a sufficient
>> condition for them to be different (mathematically, it's an
>> "invariant", isn't it?).

>Hmm.  I think I see what you're saying -- if there's only one file of a
>particular size, then you can be certain that it's not a duplicate of
>any other file.

>However, for every size for which there's more than one file of that
>size, you have to calculate a checksum of all of those files.  So
>really, the only point of keeping track of file sizes is that little "if
>there's only one file of this size" optomization.

Yes, that's it. And in my experience this was a good optimization,
that is, without it, the process always took much more time, even if I
never did any quantitative verification about this.

Quote:
>Not that it's a bad optimization, of course, but it doesn't seem like
>there's any way to use this and do only one pass through the filesystem.

Well, as I implicitly (through an example) said above, in a certain
sense you CAN do only one pass, at the expense of storing the
filenames (with paths).

Quote:
>Ok, here's what I would add to my prior code:

Thanks, as in the previous case, it will take some time for me to
examine the provided code: what is elementary for you may be advanced
for me. And I'm rather busy these days, D'Oh!

In the meantime I managed to give a peek into File::Find::Duplicates.
Quite surprisingly, it's not difficult at all to understand, so I have
plenty of sources to take inspiration from...

In the meanwhile I also ran the duplicates finding function from the
above mentioned module and contrarily to what I had expected, it
worked without going out of memory. So, after all, the tie strategy
was not needed at all for this task, but I'm happy to have begun
exploring it!

Also, I won't use File::Find::Duplicates, because
(1) It returns a hask keyed on the filesize. But each array (ref)
associated to a key might contain more than one group of duplicate
files.
(2) I have to do other, massive, "operations" on the files and I don't
want to browse through the directory tree twice. In this respect
tieing might become a strict necessity.

Michele
--
Liberta' va cercando, ch'e' si' cara,
Come sa chi per lei vita rifiuta.
           [Dante Alighieri, Purg. I, 71-72]

I am my own country - United States Confederate of Me!
           [Pennywise, "My own country"]



Tue, 12 Apr 2005 08:58:14 GMT  
 Need advice on a project (wrt to tie'ing to a file and general strategy)


Quote:

>>The DB library uses platform specifics in the database. Things like
>>the endianness of the machine, and the size of ints. That means a
>>database created on one machine isn't necessarily readable on another.

>That's right. You can blame C, which is used to compile the database
>engine for that. C tends to use pretty abstract names for its data
>types, like int(), which physically are stored in the for the machine
>most efficient way, but with no garantees at all about portability. So
>all bets on endianness, or number of bits per integer, are off.

>It is often possible to compile a database engine to use a portable file
>format, but with a bad impact on efficiency. Speed could easily halve.

Thank you both for your answer, I was taking too naive assumptions
about this. However it would be nice to have the possibility of
choosing a less efficient, though portable database file format. For
example, under Linux I have my vfat FS's mounted and could acces the
database with no conversions...

Michele
--
Liberta' va cercando, ch'e' si' cara,
Come sa chi per lei vita rifiuta.
           [Dante Alighieri, Purg. I, 71-72]

I am my own country - United States Confederate of Me!
           [Pennywise, "My own country"]



Tue, 12 Apr 2005 08:58:22 GMT  
 Need advice on a project (wrt to tie'ing to a file and general strategy)

Quote:

> On Thu, 24 Oct 2002 02:07:11 -0400, Benjamin Goldberg

> >> I still have some questions: as you might have read I had thought
> >> of using a hash whose values are array refences. In any case it
> >> seems that I can't have a tied hash of arrays, can I?!?

> >You can and you can't.  You can't, in that if you simply store an
> >arrayref in a tied database, it gets stringified as something like
> >"ARRAY(0x12345678)", and of course you can't go from this string back
> >to an array (especially not between seperate executions of perl).

> This is exactly what I meant with "I can't".

> >You can, in that if you install a filter (see perldoc perldbmfilter),
> >you can have your arrayrefs automatically serialized when stored in
> >the database and deserialized when retrieved.

> >Depending on what's in those array references you mentioned,
> >serializing can be easy or hard.

> Easy, in my case.

> >> Most probably I wasn't clear. I was thinking of an algorithm taking
> >> checksums only of files with the same size.

> >Why?  It seems to me that you would have to keep a list of all the
> >different possible sizes, then.  Also, it sounds like you'd first
> >need to do one pass to find out what each of the different sizes are,
> >and then, for each different size, another pass to deal with (find
> >checksums of) all files of that particular size.  This sounds like
> >many, many, passes through the filesystem.

> >If you just make one pass through the filesystem, calculating
> >checksums, that should be sufficient.

> Well, my old, ineffiecient and error prone script used only one pipe
> and one pass through the filesystem. The pass was done once for all
> with 'find' (printing filesizes directly), then it sorted by filesize
> and uniq'ed for printing duplicate entries only. Then checksums were
> calculated on these entries and the stream was sorted again by the
> checksums and uniq'ed for printing duplicates once again. And that's
> all...

Well, if you're storing *all* of the filenames, then discarding non-uniq
ones, then I suppose that works, too.

But you're still making multiple passes through the filenames, just not
through the filesystem.

- Show quoted text -

Quote:
> >> My original, gross, shell script used the algorithm roughly
> >> described above because taking a file size is *much* faster than
> >> taking a checksum and two files having different sizes is a
> >> sufficient condition for them to be different (mathematically, it's
> >> an "invariant", isn't it?).

> >Hmm.  I think I see what you're saying -- if there's only one file of
> >a particular size, then you can be certain that it's not a duplicate
> >of any other file.

> >However, for every size for which there's more than one file of that
> >size, you have to calculate a checksum of all of those files.  So
> >really, the only point of keeping track of file sizes is that little
> >"if there's only one file of this size" optomization.

> Yes, that's it. And in my experience this was a good optimization,
> that is, without it, the process always took much more time, even if I
> never did any quantitative verification about this.

> >Not that it's a bad optimization, of course, but it doesn't seem like
> >there's any way to use this and do only one pass through the
> >filesystem.

> Well, as I implicitly (through an example) said above, in a certain
> sense you CAN do only one pass, at the expense of storing the
> filenames (with paths).

Many algorithms have time/memory tradeoffs... you can do it fast,
consuming lots of memory, or slower, using less memory.  Perl often
picks the "throw memory at the problem" solution.  Storing the filenames
instead of refetching them thus sounds like a rather perlish solution :)

- Show quoted text -

Quote:
> >Ok, here's what I would add to my prior code:

> Thanks, as in the previous case, it will take some time for me to
> examine the provided code: what is elementary for you may be advanced
> for me. And I'm rather busy these days, D'Oh!

> In the meantime I managed to give a peek into File::Find::Duplicates.
> Quite surprisingly, it's not difficult at all to understand, so I have
> plenty of sources to take inspiration from...

> In the meanwhile I also ran the duplicates finding function from the
> above mentioned module and contrarily to what I had expected, it
> worked without going out of memory. So, after all, the tie strategy
> was not needed at all for this task, but I'm happy to have begun
> exploring it!

> Also, I won't use File::Find::Duplicates, because
> (1) It returns a hask keyed on the filesize. But each array (ref)
> associated to a key might contain more than one group of duplicate
> files.

With one small change to the subroutine, this can be changed.

From this:

To this:

Quote:
> (2) I have to do other, massive, "operations" on the files and I don't
> want to browse through the directory tree twice. In this respect
> tieing might become a strict necessity.

--
my $n = 2; print +(split //, 'e,4c3H r ktulrnsJ2tPaeh'
."\n1oa! er")[map $n = ($n * 24 + 30) % 31, (42) x 26]


Tue, 12 Apr 2005 19:54:04 GMT  
 
 [ 14 post ] 

 Relevant Pages 

1. Need advice on a project (wrt to tie'ing to a file and general strategy)

2. need advice on strategy...

3. Problems 'tie'ing hash and array databases

4. General ActivePerl questions and advice needed

5. tie'ing arrays

6. Tie'ing to various DBM formats?

7. help: dup'ing/tee'ing

8. General Personal Development Strategy

9. tie()ing STDOUT to a file being written to by anothe (child) process

10. seeking advice on design strategy

11. re-'require'ing a file

12. 'cat'ing a GIF file?

 

 
Powered by phpBB® Forum Software