detect similar files, via signatures, sketches, shingles etc 
Author Message
 detect similar files, via signatures, sketches, shingles etc

I am interested in detecting when two different files are similar to
one another.  There are quite a few research papers published on this
topic, and a variety of techniques.  Generally they involve creating
for each file its set (really bag) of signatures.  A signature is the
hash value for a string of characters, or sequence of words,
perhaps after stop word removal, conversion to lowercase, stemming,
stripping html tags, etc.  In some systems each successive character
or word causes a new signature; this is called shingling.  In other
systems there is no overlap.  A signature is typically 4 to 8 bytes.

The next phase involves keeping only some of the signatures,
e.g. about 800 bytes per file.  The set of kept signatures is
sometimes called a sketch.  Some systems keep all signatures that are
0 modulo some fixed parameter k, but this produces a sketch which gets
larger as the file size gets larger.  Other systems choose a fixed
number of signatures.  Some systems also keep a checksum, e.g. MD5,
for the entire file, to permit detecting exact duplicates.

Finally we compare one file to the others, or all files against each
other.  Those pairs over a certain threshold of similarity are
reported.  Some systems cluster together the set of files that are all
similar to one another.  If the earlier phases are done right,
e.g. the various randomizations are random enough, then the similarity
of two sketches is an unbiased estimate of the similarity of the two
files.  Some systems also/instead estimate containment of one document
in the other.  Some systems exclude very common signatures, as caused
by the prolog in a postscript file, or boilerplate legalese, because
in practice this improves the human perceived correctness of the

I am looking for a existing system that does this, or at least some of
the components out of which I could build this.  I am flexible about
almost all the details.  I only want to apply this to my disk and LAN,
not the web, and so certain extreme performance requirements that
apply to e.g. AltaVista are not an issue.  Some systems are concerned
with plagiarism detection, and so try hard to be secure (i.e. not
spoofed by adding/changing a few words), but that too is not an issue.


P.S. There might be a perl module named File::Signature, e.g.
describes it as "Heuristics for file recognition".
Unfortunately when I click on the link
http://www.*-*-*.com/ ::Signature
I get "Module 'File::Signature' not found"
In any case this is not what I am looking for, as per this email at

> ... nobody has yet registered the m{*filter*}
> equivalent of file(1) in Perl (/etc/magic and all that).
> Therefore and thusly, I propose
> Name            DSLI  Description                             Info
> -------------   ----  --------------------------------------- -----
> File::Signature cdpf  Heuristics for file recognition         JHI


Fidelity Investments   82 Devonshire St. R24D    Boston MA 02109
There is nothing so practical as a good theory.  Comments are by me,
not Fidelity Investments, its subsidiaries or affiliates.

Tue, 18 Feb 2003 04:08:30 GMT  
 [ 1 post ] 

 Relevant Pages 

1. Need help creating lookup

2. SQL "IN" clause in BDE 2.52

3. Detect Java, JavaScript Enabled via Perl

4. Detect Java, JavaScript Enabled via Perl

5. Change password in /etc/shadow via script

6. How do i create new users(/etc/passwd) via a form

7. MD5 signature - file timestamp?

8. A plea for less goofy signature files!

9. Signature files using perl?

10. Wanted: Pascal X-Modem source or DLL

11. Turbo Pascal 7 Exec Command Trouble


Powered by phpBB® Forum Software