Files and Hashes 
Author Message
 Files and Hashes

Hello,

I'm looking for a way to compare files in perl without actually comparing
them line by line.  Since I have to compare a file to about 200 others, I'm
looking for some sort of hash function that will take any file and create a
unique hash value.  Then to compare my files, I just have to see if the hash
value is already existent in my database.  

Does anyone know of such a hash function for files that will guarantee a
unique hash for any file?

Thank you.
Himani Naresh

_______________________________________________________________________
Himani Naresh
Computer Science Department
Columbia University

God put me on earth to accomplish a certain number of things.
Right now I am so far behind, I will never die.
-- Calvin and Hobbes



Fri, 27 Apr 2001 03:00:00 GMT  
 Files and Hashes
1998-11-10-03:47:04 Himani Naresh:

Quote:
>I'm looking for a way to compare files in perl without actually comparing
>them line by line.  Since I have to compare a file to about 200 others, I'm
>looking for some sort of hash function that will take any file and create a
>unique hash value.  Then to compare my files, I just have to see if the hash
>value is already existent in my database.  

>Does anyone know of such a hash function for files that will guarantee a
>unique hash for any file?

There's such a thing as a "perfect hash function", but performance-wise it's
not applicable here.

The closest you're gonna come is with something like MD5, a
cryptographically-strong hash function large enough that a collision is
stunningly improbable.

As long as the size of the hash function is way smaller than the size of the
files, the pigeonhole principle guarantees that hash functions _won't_ be
unique --- multiple different files will have the same hash. But if the hash
function is strong (a crypto hash should be --- the constraints on a good
crypto hash are a superset of the needs for this problem) then the odds of
_finding_ such a pair of files are one in 2**hash-size. MD5 is 128 bits.

A popular solution to apply for people who care to worry about a one-in-2**128
chance of collision is to compare hashes, and if they should match then go
back and compare files.

I prefer to ignore that chance; I regard it as slimmer than other failure
modes for my code, such as the boolean controlling the "did it match"
spontaneously changing from true to false due to hardware failure.

-Bennett



Sat, 28 Apr 2001 03:00:00 GMT  
 Files and Hashes


Quote:
>Hello,

>I'm looking for a way to compare files in perl without actually comparing
>them line by line.  Since I have to compare a file to about 200 others, I'm
>looking for some sort of hash function that will take any file and create a
>unique hash value.  Then to compare my files, I just have to see if the hash
>value is already existent in my database.  

>Does anyone know of such a hash function for files that will guarantee a
>unique hash for any file?

There are programs which do this very thing.  'cksum' is one that is
found on UNIX systems.  There's not really a summing function in perl
like what you need, however, here's one way to do it...

open(F1, "file1");

close F1;
open(F2, "file2");

close F2;


This works, but might take a while with big files....

With cksum, you could simply do this:

print "miscompare\n"
  if ((split(/\s+/,`cksum $file1`))[0] ne
      (split(/\s+/,`cksum $file2`))[0]);

cksum returns '<cksum> <number bytes> <file name>' and the split simply breaks up
this line and compares the two cksums.

or something similar...

Harlin!



Sat, 28 Apr 2001 03:00:00 GMT  
 Files and Hashes

Quote:

> As long as the size of the hash function is way smaller than the size of the
> files, the pigeonhole principle guarantees that hash functions _won't_ be
> unique --- multiple different files will have the same hash. But if the hash
> function is strong (a crypto hash should be --- the constraints on a good
> crypto hash are a superset of the needs for this problem) then the odds of
> _finding_ such a pair of files are one in 2**hash-size. MD5 is 128 bits.

> A popular solution to apply for people who care to worry about a one-in-2**128
> chance of collision is to compare hashes, and if they should match then go
> back and compare files.

Or to compute and compare several hashes/checksums.

First something really fast (like the sizes of the files...), then
something slightly slower and better but still fast like the usual CRC
checks (Perl has builtin one CRC, take a look at unpack("%"), then
maybe MD5 (which should be pretty good), but if even that fails maybe
other checksum/hash routines like SHA or Snefru.  If your files are
*big* it pays to avoid comparin the files unless really necessary.

Quote:
> I prefer to ignore that chance; I regard it as slimmer than other failure
> modes for my code, such as the boolean controlling the "did it match"
> spontaneously changing from true to false due to hardware failure.

True.

Quote:
> -Bennett

--
$jhi++; # http://www.iki.fi/~jhi/
        # There is this special biologist word we use for 'stable'.
        # It is 'dead'. -- Jack Cohen


Sun, 29 Apr 2001 03:00:00 GMT  
 
 [ 4 post ] 

 Relevant Pages 

1. Pre-Announce: TextDB: tie plain text DB-like files to hash

2. 5.003 stops generating .pag .dir files from hash

3. Help comparing two files using hashes

4. printing output from two files using hash

5. Merging log files using hash - good idea/possible?

6. Help comparing two files using hashes

7. Read from file into hash

8. Help with loading file into %hash?

9. Files and Hashes

10. DBM Files with Hashes of Arrays

11. Sorting dbm files (and hashes)

12. input file to hash

 

 
Powered by phpBB® Forum Software