Need advice on a project (wrt to tie'ing to a file and general strategy) 
Author Message
 Need advice on a project (wrt to tie'ing to a file and general strategy)

A few years ago I wrote a shell script to find duplicates in a
directory tree and remove them (keeping only the 1st one wrt
alphabetical order).

My script, though gross and primitive, has worked reliably for me for
a long time, but now I'd like to rewrite it in perl with some
enhancements too, one difference being that it should run also (and
possibly, mainly) on Windows.

Let me explain how it did work first. Necessary conditions for two
files to be identical are
(i) They have the same size,
(ii) They have the (a) same checksum.

I assume that (i) AND (ii) are also a sufficient condition, even if
that's simply not true, that is, I don't mind extremely rare cases of
different files having both same size and checksum.

My script worked by calculating (how is a detail) the file sizes of
files in the directory tree, sorting by the sizes and comparing the
checksums of files with the same size. To be precise, fwim, I used the
md5sum program to calculate checksums.

I used the above strategy because calculating a checksum is always
more time-consuming than printing a file size.

So, the first question regards how to calculate checksums with perl:
do I have to resort to external programs (thus, possibly, having to
use a Windows port of a *nix one for compatibility) or is there a
(recommended) module suitable for this task?

Also, I know (more than) one way to make such a program, but an
important consideration to take into account is that now I should run
it on huge (for me) "data sets" (that is of the size of 10^5 files).

I know from FAQs, etc. that a solution is to tie to a file (whatever I
will use: array, hash...), but there are many modules designed for
that. Which one do you recommend? I'd prefer a KISS solution, that is,
I know that some of them use some sort of database libraries
transparently, but it seems to me an overhead.

In case I decide to use a DB-oriented module, are there any
portability issues wrt the abovementioned libraries (e.g. are they
available and work reliably on "all" platforms?).

Also, I would like to cache the data relative to size/checksums when
they are calculated, so to "include them on the list" when running the
program subsequent times. Notice that I'm aware this strategy won't
catch all duplicates between one run and the other, but I'm sure it
would be nevertheless an enhancement for my purposes...

Michele
--
Liberta' va cercando, ch'e' si' cara,
Come sa chi per lei vita rifiuta.
           [Dante Alighieri, Purg. I, 71-72]

I am my own country - United States Confederate of Me!
           [Pennywise, "My own country"]



Tue, 05 Apr 2005 08:14:31 GMT  
 Need advice on a project (wrt to tie'ing to a file and general strategy)

...

Quote:
> So, the first question regards how to calculate checksums with perl:
> do I have to resort to external programs (thus, possibly, having to
> use a Windows port of a *nix one for compatibility) or is there a
> (recommended) module suitable for this task?

Check out Digest::MD5.

Quote:

> Also, I know (more than) one way to make such a program, but an
> important consideration to take into account is that now I should run
> it on huge (for me) "data sets" (that is of the size of 10^5 files).

Check out the addfile method to Digest::MD5.  It is probably about as
good a you will do for efficiency.

...

Quote:
> Michele

--
Bob Walton


Wed, 06 Apr 2005 01:03:51 GMT  
 Need advice on a project (wrt to tie'ing to a file and general strategy)


Quote:
> So, the first question regards how to calculate checksums with perl:
> do I have to resort to external programs (thus, possibly, having to
> use a Windows port of a *nix one for compatibility) or is there a
> (recommended) module suitable for this task?

As pointed out in the previous post, there's Digest::MD5 on CPAN for your
need. However, this beast is very slow (not because of Perl, it's the
algorithm), but good to bring the probability of collision of two distinct
file close to zero.
If you don't mind the chance of two distinct files with the same filename
and checksum, and like to speed up your checksum computation significantly,
you might consider using CRC32. Correct me if I'm wrong, the collision
probability of CRC32 is 2^(-32) which is approximately one out of 4 billion
trials. Might be good enough for you application.

Quote:

> Also, I know (more than) one way to make such a program, but an
> important consideration to take into account is that now I should run
> it on huge (for me) "data sets" (that is of the size of 10^5 files).

If you like to keep things in memory, you might want to give an estimate on
how much memory you need to hold the whole thing. This varies widely
depending on the average length of the filenames, average length of string
of size of files and the number of files. (I assume that you'd like to keep
a hash keyed on the filesize with values being filenames), plus a fair
amount of memory needed for checksum computation and Perl.
estimate = number of files * (average length of filenames + average length
of string of size of files) + 30 % extra (Perl and checksum computation).
Quote:
> I know from FAQs, etc. that a solution is to tie to a file (whatever I
> will use: array, hash...), but there are many modules designed for
> that. Which one do you recommend? I'd prefer a KISS solution, that is,
> I know that some of them use some sort of database libraries
> transparently, but it seems to me an overhead.

> In case I decide to use a DB-oriented module, are there any
> portability issues wrt the abovementioned libraries (e.g. are they
> available and work reliably on "all" platforms?).

> Also, I would like to cache the data relative to size/checksums when
> they are calculated, so to "include them on the list" when running the
> program subsequent times. Notice that I'm aware this strategy won't
> catch all duplicates between one run and the other, but I'm sure it
> would be nevertheless an enhancement for my purposes...

> Michele
> --
> Liberta' va cercando, ch'e' si' cara,
> Come sa chi per lei vita rifiuta.
>            [Dante Alighieri, Purg. I, 71-72]

> I am my own country - United States Confederate of Me!
>            [Pennywise, "My own country"]



Wed, 06 Apr 2005 06:41:39 GMT  
 Need advice on a project (wrt to tie'ing to a file and general strategy)


Quote:
>As pointed out in the previous post, there's Digest::MD5 on CPAN for your
>need. However, this beast is very slow (not because of Perl, it's the
>algorithm), but good to bring the probability of collision of two distinct
>file close to zero.

Don't laugh, but I seem to remember (not sure myself) that actually I
had chosen md5sum for some "ease of use" due to its fixed-length
output wrt, e.g. 'cksum'.

Quote:
>If you don't mind the chance of two distinct files with the same filename
>and checksum, and like to speed up your checksum computation significantly,
>you might consider using CRC32. Correct me if I'm wrong, the collision
>probability of CRC32 is 2^(-32) which is approximately one out of 4 billion
>trials. Might be good enough for you application.

Indeed it would be good enough for my application. Now is there a
CRC32 module for perl? (OK, I'll search CPAN...)

Quote:
>If you like to keep things in memory, you might want to give an estimate on
>how much memory you need to hold the whole thing. This varies widely
>depending on the average length of the filenames, average length of string
>of size of files and the number of files. (I assume that you'd like to keep
>a hash keyed on the filesize with values being filenames), plus a fair

I was rather thinking of a hash keyed on the filesize with values
being references to arrays whose first entries are filenames and the
second ones, if present, are the cksums.

And I would keep those entries that have a checksum setting the
filename to "" (multiple ones thus collapsing into a single one) for
subsequent use, as I explained in the previous post.

Quote:
>estimate = number of files * (average length of filenames + average length
>of string of size of files) + 30 % extra (Perl and checksum computation).

I have total sum of (length of filenames + length of string of size of
files) = 6334Kb in one particular test.

Michele
--
Liberta' va cercando, ch'e' si' cara,
Come sa chi per lei vita rifiuta.
           [Dante Alighieri, Purg. I, 71-72]

I am my own country - United States Confederate of Me!
           [Pennywise, "My own country"]



Thu, 07 Apr 2005 09:01:59 GMT  
 
 [ 4 post ] 

 Relevant Pages 

1. Need advice on a project (wrt to tie'ing to a file and general strategy)

2. need advice on strategy...

3. Problems 'tie'ing hash and array databases

4. General ActivePerl questions and advice needed

5. tie'ing arrays

6. Tie'ing to various DBM formats?

7. help: dup'ing/tee'ing

8. General Personal Development Strategy

9. tie()ing STDOUT to a file being written to by anothe (child) process

10. seeking advice on design strategy

11. re-'require'ing a file

12. 'cat'ing a GIF file?

 

 
Powered by phpBB® Forum Software