Hashing Function: Anyone know a good one?
Author |
Message |
Ryan Gallaghe #1 / 10
|
 Hashing Function: Anyone know a good one?
I have a Database project that I am working on. The project parses large LOG files which contain references to other files that contain potential Y2k problems. (Multiple log files can be parsed n a single scan [wildcards can be used]) The LOG files can contain upto 50000 of these file references, and therefore need to be parsed quickly. We are using a C++ Com DLL to do the parsing, and i have added an interface to generate the FileKey as a double I need a hashfunction that genereated a unique number based on three pieces of information: 1) A unique number generated for each LOG file. 2) A number representing the type of file ( 1,2 or 3) 3) the path of the file. if a record with those exact same pieces of information is inserted at a later date, the insert should fail (therefore, using a time stamp wont work) Some examples for a single log file could be as follows: 3013 , 2 , 'd:\bc5\include\win16\colordlg.h' 3013 , 2 , 'd:\bc5\include\win16\error.h' 3013 , 1 , 'd:\bc5\include\win16\inl.h' 3013 , 2 , 'd:\bc5\include\win16\color15.h' 3013 , 3 , 'd:\bc5\include\win16\apitext.h' 3013 , 2 , 'd:\bc5\include\win16\values.h' And the next log file would be something like 3014 , 2 , 'd:\bc5\include\win16\colordlg.h' 3014 , 2 , 'd:\bc5\include\win16\error.h' 3014 , 3 , 'd:\bc5\include\win16\inl.h' 3014 , 1 , 'd:\bc5\include\win16\color15.mdb' 3014 , 3 , 'd:\bc5\include\win16\apitext.h' 3014 , 2 , 'd:\bc5\include\win16\values.h' So potentially the many of the paths could be very similar (except for the last 10 or 20 characters.) i realize that if the Logfile Id (the 3014) is the most significant, then it decreases the chances of a collision, but there is still potentially 50000 -100000 files per logfile id. Since this is a DB application though, I dont care about even distribution, I just need a unique way of identifying those three pieces of information. I've tried some different methods, but none of them seemed to work! PLEASE HELP ME! =================== (Mail Me Your Replies if possible) Ryan
|
Sun, 12 Nov 2000 03:00:00 GMT |
|
 |
Stephan Wilm #2 / 10
|
 Hashing Function: Anyone know a good one?
[snip] Quote: > I need a hashfunction that genereated a unique number based on three pieces > of information: > 1) A unique number generated for each LOG file. > 2) A number representing the type of file ( 1,2 or 3) > 3) the path of the file. > if a record with those exact same pieces of information is inserted at a > later date, the insert should fail (therefore, using a time stamp wont work)
At this point I begin to suspect, that you do not have the same definition of hashing, that is commenly accosiated with the term. What you want is one unique key for each unique object, in order to identify that object. This is not what hashing does. Hashing calculates a number for a given object, but this number is not necessarily unqiue. Two disctinct objects may produce the same hash key. That's why hashing always involves resolving clashes. The quality of a hash function is judged by the statistical uniquenes of the hash key. [snip] Quote: > i realize that if the Logfile Id (the 3014) is the most significant, then it > decreases the chances of a collision, but there is still potentially > 50000 -100000 files per logfile id. Since this is a DB application though, > I dont care about even distribution, I just need a unique way of identifying > those three pieces of information.
Yep, this confirms it. Hashing can be used to reduce the necessary comparisons considerably, by narrowing the search down to maybe some dozend entries that must be checked one by one, but hashing will not be able to give you a unique key. Quote: > I've tried some different methods, but none of them seemed to work!
Getting a unique key for a given data structure is difficult. One way is to simply put the structure through a compression algorithm and use the resulting sequence of bytes as the key number. Because that is exactly what you want to do: reduce the size of the data, but retain the uniqueness. That is compression without loss of information. Stephan (initiator of the campaign against grumpiness in c.l.c)
|
Mon, 13 Nov 2000 03:00:00 GMT |
|
 |
Petronious Arbit #3 / 10
|
 Hashing Function: Anyone know a good one?
Quote:
>I have a Database project that I am working on. >The project parses large LOG files which contain references to other files >that contain potential Y2k problems. >(Multiple log files can be parsed n a single scan [wildcards can be used]) >The LOG files can contain upto 50000 of these file references, and therefore >need to be parsed quickly. >We are using a C++ Com DLL to do the parsing, and i have added an interface >to generate the FileKey as a double >I need a hashfunction that genereated a unique number based on three pieces >of information: >1) A unique number generated for each LOG file. >2) A number representing the type of file ( 1,2 or 3) >3) the path of the file.
A hash function does noit generate a unique value. A hash value tries to distribute keys evenly over a range of values. Do some math. Take the minimum number of bits requires to represent each in a character in a file name distinctly and raise that number to the power of the maximum number of characters in a file name. You are looking at huge number of bits. If you want a unique number you are going to have to allocate them sequentially. John - N8086N ------------------------------------------------ The US Senate (The finest bunch of men money can rent) passed the "American Competitiveness Act". Once again, money wins out over nation interests. YOUR JOB IS AT STAKE! Write your congressman to oppose the "American Competitiveness Act" when it comes to the House. ------------------------------------------------ EMail Address:
|c.o.l.o.s.s.e.u.m.b.u.i.l.d.e.r.s.| |c.o.m.|
|
Mon, 13 Nov 2000 03:00:00 GMT |
|
 |
Barry Margoli #4 / 10
|
 Hashing Function: Anyone know a good one?
Quote:
>[snip] >> I need a hashfunction that genereated a unique number based on three pieces >> of information: >> 1) A unique number generated for each LOG file. >> 2) A number representing the type of file ( 1,2 or 3) >> 3) the path of the file. >Getting a unique key for a given data structure is difficult. One way >is to simply put the structure through a compression algorithm and >use the resulting sequence of bytes as the key number. Because that >is exactly what you want to do: reduce the size of the data, but retain >the uniqueness. That is compression without loss of information.
However, most general-purpose compression algorithms will not be able to compress a short string like the above very much. Compression algorithms look for things that appear multiple times (e.g. common letters, common sequences, long runs of the same letter, etc.), and generate short codes for them. A context-dependent compression algorithm would probably be necessary if you want something useful. For instance, if all the paths begin with either /usr/ or /dev/, you could abbreviate these as 1/ and 2/, respectively. If there are a small number of choices at each level of the hierarchy, you could combine them into appropriately-sized bit fields of a single word. --
GTE Internetworking, Powered by BBN, Cambridge, MA *** DON'T SEND TECHNICAL QUESTIONS DIRECTLY TO ME, post them to newsgroups.
|
Mon, 13 Nov 2000 03:00:00 GMT |
|
 |
Desmond Daignaul #5 / 10
|
 Hashing Function: Anyone know a good one?
Try: http://ourworld.compuserve.com/homepages/bob_jenkins/blockcip.htm He gives a good explaination of various algorithms to produce a unique value for any given set of data. (well... unique to a certain degree usually something with a collision ratio of 1:2^32) Quote:
> [snip] > > I need a hashfunction that genereated a unique number based on three pieces > > of information: > > 1) A unique number generated for each LOG file. > > 2) A number representing the type of file ( 1,2 or 3) > > 3) the path of the file. > > if a record with those exact same pieces of information is inserted at a > > later date, the insert should fail (therefore, using a time stamp wont work) > At this point I begin to suspect, that you do not have the same > definition of hashing, that is commenly accosiated with the term. > What you want is one unique key for each unique object, in order > to identify that object. This is not what hashing does. > Hashing calculates a number for a given object, but this number is > not necessarily unqiue. Two disctinct objects may produce the same > hash key. That's why hashing always involves resolving clashes. > The quality of a hash function is judged by the statistical uniquenes > of the hash key. > [snip] > > i realize that if the Logfile Id (the 3014) is the most significant, then it > > decreases the chances of a collision, but there is still potentially > > 50000 -100000 files per logfile id. Since this is a DB application though, > > I dont care about even distribution, I just need a unique way of identifying > > those three pieces of information. > Yep, this confirms it. Hashing can be used to reduce the necessary > comparisons considerably, by narrowing the search down to maybe > some dozend entries that must be checked one by one, but hashing > will not be able to give you a unique key. > > I've tried some different methods, but none of them seemed to work! > Getting a unique key for a given data structure is difficult. One way > is to simply put the structure through a compression algorithm and > use the resulting sequence of bytes as the key number. Because that > is exactly what you want to do: reduce the size of the data, but retain > the uniqueness. That is compression without loss of information. > Stephan > (initiator of the campaign against grumpiness in c.l.c)
|
Mon, 13 Nov 2000 03:00:00 GMT |
|
 |
Stefan Monnie #6 / 10
|
 Hashing Function: Anyone know a good one?
Quote:
> I need a hashfunction that genereated a unique number based on three pieces > of information: > 1) A unique number generated for each LOG file. > 2) A number representing the type of file ( 1,2 or 3) > 3) the path of the file.
As mentionned, what you want is not strictly speaking a hash-function. The first and second field are already rather small, so there's not much point trying to compress them. To concisely encode the path, I'd use an associative table (hash-table or prefix-tree) that maps already-seen-paths to small unique integers (maintained unique by the use of a counter, for instance). Stefan
|
Mon, 13 Nov 2000 03:00:00 GMT |
|
 |
Mark Tillotso #7 / 10
|
 Hashing Function: Anyone know a good one?
| | [snip] | > I need a hashfunction that genereated a unique number based on three pieces | > of information: | > | > 1) A unique number generated for each LOG file. | > 2) A number representing the type of file ( 1,2 or 3) | > 3) the path of the file. | > | > if a record with those exact same pieces of information is inserted at a | > later date, the insert should fail (therefore, using a time stamp wont work) | | At this point I begin to suspect, that you do not have the same | definition of hashing, that is commenly accosiated with the term. | What you want is one unique key for each unique object, in order | to identify that object. This is not what hashing does. | | Hashing calculates a number for a given object, but this number is | not necessarily unqiue. Two disctinct objects may produce the same | hash key. That's why hashing always involves resolving clashes. | The quality of a hash function is judged by the statistical uniquenes | of the hash key. | But using a cryptographic strength hash function you can push down the probability of such a hash collision sufficiently far that you don't need to worry about it happening in your lifetime... (Or in this case you'd need, say, a million million million million log files before the chance of a collision became realistic with a 192 bit crypto hash) Does this really need to be cross-posted to comp.lang.functional? __Mark
|
Tue, 14 Nov 2000 03:00:00 GMT |
|
 |
Kevin B Bla #8 / 10
|
 Hashing Function: Anyone know a good one?
;> ;>[snip] ;>> I need a hashfunction that genereated a unique number based on three pieces ;>> of information: ;>> ;>> 1) A unique number generated for each LOG file. ;>> 2) A number representing the type of file ( 1,2 or 3) ;>> 3) the path of the file. ;>> ;>> if a record with those exact same pieces of information is inserted at a ;>> later date, the insert should fail (therefore, using a time stamp wont work) ;> ;>At this point I begin to suspect, that you do not have the same ;>definition of hashing, that is commenly accosiated with the term. ;>What you want is one unique key for each unique object, in order Quote: >to identify that object. This is not what hashing does.
;> Perfect hashing functions do exists, and can be used (although probably not for the purpose described by the above poster!). See "A Letter-oriented Minimal Perfect Hashing Function" in the "COMPUTER JOURNAL" Vol. 29, No 3 (1986) PP 277-281. He may be better of finding another sceme for assigning the numbers. ;>Hashing calculates a number for a given object, but this number is ;>not necessarily unqiue. Two disctinct objects may produce the same ;>hash key. That's why hashing always involves resolving clashes. ;>The quality of a hash function is judged by the statistical uniquenes ;>of the hash key. Correct, this is the general case. regards, KevinBB.
|
Tue, 14 Nov 2000 03:00:00 GMT |
|
 |
Ken Walt #9 / 10
|
 Hashing Function: Anyone know a good one?
Quote:
> ;> > ;>[snip] > ;>> I need a hashfunction that genereated a unique number based on three pieces > ;>> of information: > ;>> > ;>> 1) A unique number generated for each LOG file. > ;>> 2) A number representing the type of file ( 1,2 or 3) > ;>> 3) the path of the file. > ;>>
If collisions can be handled occasionally, placing all the info in a buffer and computing a 32bit CRC may be unique enough. If there are a very large number of objects using a 64 bit CRC would likely make it sparse enough. Ken Walter Remove .zamboni to reply All the above is hearsay and the opinion of no one in particular
|
Tue, 14 Nov 2000 03:00:00 GMT |
|
 |
Lawrence Kir #10 / 10
|
 Hashing Function: Anyone know a good one?
Quote:
>| >| [snip] >| > I need a hashfunction that genereated a unique number based on three pieces >| > of information: >| > >| > 1) A unique number generated for each LOG file. >| > 2) A number representing the type of file ( 1,2 or 3) >| > 3) the path of the file. >| > if a record with those exact same pieces of information is inserted at a >| > later date, the insert should fail (therefore, using a time stamp wont work)>|
Lets say you have a way to generate this unique number, how are you then going to use it? Are you goinf to create a separate index datastructre? if so the details of that will be significant. Either way I don't see that generating this unique number will help you much. How often are these collisions likely to occur? Quote: >| At this point I begin to suspect, that you do not have the same >| definition of hashing, that is commenly accosiated with the term. >| What you want is one unique key for each unique object, in order >| to identify that object. This is not what hashing does. >| >| Hashing calculates a number for a given object, but this number is >| not necessarily unqiue. Two disctinct objects may produce the same >| hash key. That's why hashing always involves resolving clashes. >| The quality of a hash function is judged by the statistical uniquenes >| of the hash key. >| >But using a cryptographic strength hash function you can push down the >probability of such a hash collision sufficiently far that you don't >need to worry about it happening in your lifetime...
A cryptographic hash function is like a sledgehammer to the proverbial nut. The main property a cryptographic hash function has is that it makes it difficult to create a set of data that hashes to a particular value. However that property isn't needed here. Certainly as part of that it needs to minimise the probability of real data hashing to the same value by chance, but there are much simpler ways to achieve that. That is the precise property that a good checksum algorithm needs and something like a CRC has more than proved its worth in that field. Quote: > (Or in this case >you'd need, say, a million million million million log files before >the chance of a collision became realistic with a 192 bit crypto hash)
The main thing here is the width of the resulting value rather than the algorithm used. 192 bits is probably excessive in this case. If space is no object then the original data might as well be stored in full. The stated problem seemed to be one of speed. If the whole LOG file is being scanned then the bottleneck is most likelt o be disk speed in which case faster testing isn't going to help much, if at all. Quote: >Does this really need to be cross-posted to comp.lang.functional?
Probably not, I've removed that newsgroup from follow-ups. -- -----------------------------------------
-----------------------------------------
|
Fri, 17 Nov 2000 03:00:00 GMT |
|
|
|