Hashing Function: Anyone know a good one? 
Author Message
 Hashing Function: Anyone know a good one?

I have a Database project that I am working on.

The project parses large LOG files which contain references to other files
that contain potential Y2k problems.
(Multiple log files can be parsed n a single scan [wildcards can be used])

The LOG files can contain upto 50000 of these file references, and therefore
need to be parsed quickly.

We are using a C++ Com DLL to do the parsing, and i have added an interface
to generate the FileKey as a double

I need a hashfunction that genereated a unique number based on three pieces
of information:

1) A unique number generated for each LOG file.
2) A number representing the type of file ( 1,2 or 3)
3) the path of the file.

if a record with those exact same pieces of information is inserted at a
later date, the insert should fail (therefore, using a time stamp wont work)

Some examples for a single log file could be as follows:

3013    ,    2    ,    'd:\bc5\include\win16\colordlg.h'
3013    ,    2    ,    'd:\bc5\include\win16\error.h'
3013    ,    1    ,    'd:\bc5\include\win16\inl.h'
3013    ,    2    ,    'd:\bc5\include\win16\color15.h'
3013    ,    3    ,    'd:\bc5\include\win16\apitext.h'
3013    ,    2    ,    'd:\bc5\include\win16\values.h'

And the next log file would be something like

3014    ,    2    ,    'd:\bc5\include\win16\colordlg.h'
3014    ,    2    ,    'd:\bc5\include\win16\error.h'
3014    ,    3    ,    'd:\bc5\include\win16\inl.h'
3014    ,    1    ,    'd:\bc5\include\win16\color15.mdb'
3014    ,    3    ,    'd:\bc5\include\win16\apitext.h'
3014    ,    2    ,    'd:\bc5\include\win16\values.h'

So potentially the many of the paths could be very similar (except for the
last 10 or 20 characters.)

i realize that if the Logfile Id (the 3014) is the most significant, then it
decreases the chances of a collision, but there is still potentially
50000 -100000 files per logfile id.  Since this is a DB application though,
I dont care about even distribution, I just need a unique way of identifying
those three pieces of information.

I've tried some different methods, but none of them seemed to work!

PLEASE HELP ME!
===================

(Mail Me Your Replies if possible)
Ryan



Sun, 12 Nov 2000 03:00:00 GMT  
 Hashing Function: Anyone know a good one?

[snip]

Quote:
> I need a hashfunction that genereated a unique number based on three pieces
> of information:

> 1) A unique number generated for each LOG file.
> 2) A number representing the type of file ( 1,2 or 3)
> 3) the path of the file.

> if a record with those exact same pieces of information is inserted at a
> later date, the insert should fail (therefore, using a time stamp wont work)

At this point I begin to suspect, that you do not have the same
definition of hashing, that is commenly accosiated with the term.
What you want is one unique key for each unique object, in order
to identify that object. This is not what hashing does.

Hashing calculates a number for a given object, but this number is
not necessarily unqiue. Two disctinct objects may produce the same
hash key. That's why hashing always involves resolving clashes.
The quality of a hash function is judged by the statistical uniquenes
of the hash key.

[snip]

Quote:
> i realize that if the Logfile Id (the 3014) is the most significant, then it
> decreases the chances of a collision, but there is still potentially
> 50000 -100000 files per logfile id.  Since this is a DB application though,
> I dont care about even distribution, I just need a unique way of identifying
> those three pieces of information.

Yep, this confirms it. Hashing can be used to reduce the necessary
comparisons considerably, by narrowing the search down to maybe
some dozend entries that must be checked one by one, but hashing
will not be able to give you a unique key.

Quote:
> I've tried some different methods, but none of them seemed to work!

Getting a unique key for a given data structure is difficult. One way
is to simply put the structure through a compression algorithm and
use the resulting sequence of bytes as the key number. Because that
is exactly what you want to do: reduce the size of the data, but retain
the uniqueness. That is compression without loss of information.

Stephan
(initiator of the campaign against grumpiness in c.l.c)



Mon, 13 Nov 2000 03:00:00 GMT  
 Hashing Function: Anyone know a good one?

Quote:

>I have a Database project that I am working on.

>The project parses large LOG files which contain references to other files
>that contain potential Y2k problems.
>(Multiple log files can be parsed n a single scan [wildcards can be used])

>The LOG files can contain upto 50000 of these file references, and therefore
>need to be parsed quickly.

>We are using a C++ Com DLL to do the parsing, and i have added an interface
>to generate the FileKey as a double

>I need a hashfunction that genereated a unique number based on three pieces
>of information:

>1) A unique number generated for each LOG file.
>2) A number representing the type of file ( 1,2 or 3)
>3) the path of the file.

A hash function does noit generate a unique value. A hash value tries to
distribute keys evenly over a range of values. Do some math. Take the
minimum number of bits requires to represent each in a character in a
file name distinctly and raise that number to the power of the maximum number
of characters in a file name. You are looking at huge number of bits. If you
want a unique number you are going to have to allocate them sequentially.

John - N8086N
------------------------------------------------
The US Senate (The finest bunch of men money can
rent) passed the "American Competitiveness Act".
Once again, money wins out over nation interests.

YOUR JOB IS AT STAKE! Write your congressman to
oppose the "American Competitiveness Act" when
it comes to the House.
------------------------------------------------
EMail Address:

|c.o.l.o.s.s.e.u.m.b.u.i.l.d.e.r.s.|
|c.o.m.|



Mon, 13 Nov 2000 03:00:00 GMT  
 Hashing Function: Anyone know a good one?


Quote:

>[snip]
>> I need a hashfunction that genereated a unique number based on three pieces
>> of information:

>> 1) A unique number generated for each LOG file.
>> 2) A number representing the type of file ( 1,2 or 3)
>> 3) the path of the file.
>Getting a unique key for a given data structure is difficult. One way
>is to simply put the structure through a compression algorithm and
>use the resulting sequence of bytes as the key number. Because that
>is exactly what you want to do: reduce the size of the data, but retain
>the uniqueness. That is compression without loss of information.

However, most general-purpose compression algorithms will not be able to
compress a short string like the above very much.  Compression algorithms
look for things that appear multiple times (e.g. common letters, common
sequences, long runs of the same letter, etc.), and generate short codes
for them.

A context-dependent compression algorithm would probably be necessary if
you want something useful.  For instance, if all the paths begin with
either /usr/ or /dev/, you could abbreviate these as 1/ and 2/,
respectively.  If there are a small number of choices at each level of the
hierarchy, you could combine them into appropriately-sized bit fields of a
single word.

--

GTE Internetworking, Powered by BBN, Cambridge, MA
*** DON'T SEND TECHNICAL QUESTIONS DIRECTLY TO ME, post them to newsgroups.



Mon, 13 Nov 2000 03:00:00 GMT  
 Hashing Function: Anyone know a good one?

Try:
http://ourworld.compuserve.com/homepages/bob_jenkins/blockcip.htm

He gives a good explaination of various algorithms to produce a unique
value for any given set of data. (well... unique to a certain degree
usually something with a collision ratio of 1:2^32)

Quote:


> [snip]
> > I need a hashfunction that genereated a unique number based on three pieces
> > of information:

> > 1) A unique number generated for each LOG file.
> > 2) A number representing the type of file ( 1,2 or 3)
> > 3) the path of the file.

> > if a record with those exact same pieces of information is inserted at a
> > later date, the insert should fail (therefore, using a time stamp wont work)

> At this point I begin to suspect, that you do not have the same
> definition of hashing, that is commenly accosiated with the term.
> What you want is one unique key for each unique object, in order
> to identify that object. This is not what hashing does.

> Hashing calculates a number for a given object, but this number is
> not necessarily unqiue. Two disctinct objects may produce the same
> hash key. That's why hashing always involves resolving clashes.
> The quality of a hash function is judged by the statistical uniquenes
> of the hash key.

> [snip]
> > i realize that if the Logfile Id (the 3014) is the most significant, then it
> > decreases the chances of a collision, but there is still potentially
> > 50000 -100000 files per logfile id.  Since this is a DB application though,
> > I dont care about even distribution, I just need a unique way of identifying
> > those three pieces of information.

> Yep, this confirms it. Hashing can be used to reduce the necessary
> comparisons considerably, by narrowing the search down to maybe
> some dozend entries that must be checked one by one, but hashing
> will not be able to give you a unique key.

> > I've tried some different methods, but none of them seemed to work!

> Getting a unique key for a given data structure is difficult. One way
> is to simply put the structure through a compression algorithm and
> use the resulting sequence of bytes as the key number. Because that
> is exactly what you want to do: reduce the size of the data, but retain
> the uniqueness. That is compression without loss of information.

> Stephan
> (initiator of the campaign against grumpiness in c.l.c)



Mon, 13 Nov 2000 03:00:00 GMT  
 Hashing Function: Anyone know a good one?

Quote:

> I need a hashfunction that genereated a unique number based on three pieces
> of information:

> 1) A unique number generated for each LOG file.
> 2) A number representing the type of file ( 1,2 or 3)
> 3) the path of the file.

As mentionned, what you want is not strictly speaking a hash-function.  The
first and second field are already rather small, so there's not much point
trying to compress them.  To concisely encode the path, I'd use an associative
table (hash-table or prefix-tree) that maps already-seen-paths to small unique
integers (maintained unique by the use of a counter, for instance).

        Stefan



Mon, 13 Nov 2000 03:00:00 GMT  
 Hashing Function: Anyone know a good one?

|
| [snip]
| > I need a hashfunction that genereated a unique number based on three pieces
| > of information:
| >
| > 1) A unique number generated for each LOG file.
| > 2) A number representing the type of file ( 1,2 or 3)
| > 3) the path of the file.
| >
| > if a record with those exact same pieces of information is inserted at a
| > later date, the insert should fail (therefore, using a time stamp wont work)
|
| At this point I begin to suspect, that you do not have the same
| definition of hashing, that is commenly accosiated with the term.
| What you want is one unique key for each unique object, in order
| to identify that object. This is not what hashing does.
|
| Hashing calculates a number for a given object, but this number is
| not necessarily unqiue. Two disctinct objects may produce the same
| hash key. That's why hashing always involves resolving clashes.
| The quality of a hash function is judged by the statistical uniquenes
| of the hash key.
|
But using a cryptographic strength hash function you can push down the
probability of such a hash collision sufficiently far that you don't
need to worry about it happening in your lifetime...  (Or in this case
you'd need, say, a million million million million log files before
the chance of a collision became realistic with a 192 bit crypto hash)

Does this really need to be cross-posted to comp.lang.functional?

__Mark



Tue, 14 Nov 2000 03:00:00 GMT  
 Hashing Function: Anyone know a good one?

;>
;>[snip]
;>> I need a hashfunction that genereated a unique number based on three pieces
;>> of information:
;>>
;>> 1) A unique number generated for each LOG file.
;>> 2) A number representing the type of file ( 1,2 or 3)
;>> 3) the path of the file.
;>>
;>> if a record with those exact same pieces of information is inserted at a
;>> later date, the insert should fail (therefore, using a time stamp wont work)
;>
;>At this point I begin to suspect, that you do not have the same
;>definition of hashing, that is commenly accosiated with the term.
;>What you want is one unique key for each unique object, in order

Quote:
>to identify that object. This is not what hashing does.

;>
Perfect hashing functions do exists, and can be used (although probably not
for the purpose described by the above poster!). See "A Letter-oriented
Minimal Perfect Hashing Function" in the "COMPUTER JOURNAL" Vol. 29, No 3
(1986) PP 277-281. He may be better of finding another sceme for assigning the
numbers.

;>Hashing calculates a number for a given object, but this number is
;>not necessarily unqiue. Two disctinct objects may produce the same
;>hash key. That's why hashing always involves resolving clashes.
;>The quality of a hash function is judged by the statistical uniquenes
;>of the hash key.
Correct, this is the general case.

regards, KevinBB.



Tue, 14 Nov 2000 03:00:00 GMT  
 Hashing Function: Anyone know a good one?


Quote:


> ;>
> ;>[snip]
> ;>> I need a hashfunction that genereated a unique number based on three pieces
> ;>> of information:
> ;>>
> ;>> 1) A unique number generated for each LOG file.
> ;>> 2) A number representing the type of file ( 1,2 or 3)
> ;>> 3) the path of the file.
> ;>>

If collisions can be handled occasionally,
placing all the info in a buffer and computing a 32bit
CRC may be unique enough.
If there are a very large number of objects
using a 64 bit CRC would likely make it sparse enough.

Ken Walter

Remove .zamboni to reply
All the above is hearsay and the opinion of no one in particular



Tue, 14 Nov 2000 03:00:00 GMT  
 Hashing Function: Anyone know a good one?



Quote:

>|
>| [snip]
>| > I need a hashfunction that genereated a unique number based on three pieces
>| > of information:
>| >
>| > 1) A unique number generated for each LOG file.
>| > 2) A number representing the type of file ( 1,2 or 3)
>| > 3) the path of the file.
>| > if a record with those exact same pieces of information is inserted at a
>| > later date, the insert should fail (therefore, using a time stamp wont work)>|

Lets say you have a way to generate this unique number, how are you then
going to use it? Are you goinf to create a separate index datastructre?
if so the details of that will be significant. Either way I don't see that
generating this unique number will help you much.

How often are these collisions likely to occur?

Quote:
>| At this point I begin to suspect, that you do not have the same
>| definition of hashing, that is commenly accosiated with the term.
>| What you want is one unique key for each unique object, in order
>| to identify that object. This is not what hashing does.
>|
>| Hashing calculates a number for a given object, but this number is
>| not necessarily unqiue. Two disctinct objects may produce the same
>| hash key. That's why hashing always involves resolving clashes.
>| The quality of a hash function is judged by the statistical uniquenes
>| of the hash key.
>|
>But using a cryptographic strength hash function you can push down the
>probability of such a hash collision sufficiently far that you don't
>need to worry about it happening in your lifetime...

A cryptographic hash function is like a sledgehammer to the proverbial nut.
The main property a cryptographic hash function has is that it makes it
difficult to create a set of data that hashes to a particular value.
However that property isn't needed here. Certainly as part of that it needs
to minimise the probability of real data hashing to the same value by
chance, but there are much simpler ways to achieve that. That is the
precise property that a good checksum algorithm needs and something like
a CRC has more than proved its worth in that field.

Quote:
> (Or in this case
>you'd need, say, a million million million million log files before
>the chance of a collision became realistic with a 192 bit crypto hash)

The main thing here is the width of the resulting value rather than the
algorithm used. 192 bits is probably excessive in this case. If space is no
object then the original data might as well be stored in full. The stated
problem seemed to be one of speed. If the whole LOG file is being scanned
then the bottleneck is most likelt o be disk speed in which case faster
testing isn't going to help much, if at all.

Quote:
>Does this really need to be cross-posted to comp.lang.functional?

Probably not, I've removed that newsgroup from follow-ups.

--
-----------------------------------------


-----------------------------------------



Fri, 17 Nov 2000 03:00:00 GMT  
 
 [ 10 post ] 

 Relevant Pages 

1. anyone know good book on assembler?

2. Anyone know some good chess programming sources?

3. Does anyone know any good C programming URLs?

4. anyone know a good advanced C book?

5. Anyone Know a good site for Jobs?

6. Does anyone know about a good profiler?

7. Anyone know a good 2D Vector Libray

8. HELP: Need a good hash function

9. The good old Hashing function

10. good hashing function for similar strings

11. Good hash functions for pointers ...

12. Need good hashing functions

 

 
Powered by phpBB® Forum Software