Hashing long input strings into short output strings 
Author Message
 Hashing long input strings into short output strings

I need a way to hash long input strings, up to 1600 chars or so, into
short output strings, around 40 chars or so.

Is there a good hash technique for doing this in C?

Thanx,
Brandon
--



Sat, 06 Dec 2003 09:16:24 GMT  
 Hashing long input strings into short output strings

Quote:

> I need a way to hash long input strings, up to 1600 chars or so, into
> short output strings, around 40 chars or so.

> Is there a good hash technique for doing this in C?

How fast must it be?  How important is it to avoid collisions?
If the answers are "not very", and "very much", use an md5sum
(google for that if you don't know what one is).  If the answers
are "very" and "not so much", take a look at this page:
        http://burtleburtle.net/bob/hash/doobs.html
--
"Some people *are* arrogant, and others read the FAQ."
--Chris Dollin
--



Sat, 06 Dec 2003 12:11:51 GMT  
 Hashing long input strings into short output strings

Quote:
> I need a way to hash long input strings, up to 1600 chars or so, into
> short output strings, around 40 chars or so.

> Is there a good hash technique for doing this in C?

Yes.
--
C-FAQ: http://www.eskimo.com/~scs/C-faq/top.html
 "The C-FAQ Book" ISBN 0-201-84519-9
C.A.P. FAQ: ftp://cap.connx.com/pub/Chess%20Analysis%20Project%20FAQ.htm
--



Sat, 06 Dec 2003 12:11:59 GMT  
 Hashing long input strings into short output strings


Quote:
> I need a way to hash long input strings, up to 1600 chars or so, into
> short output strings, around 40 chars or so.

> Is there a good hash technique for doing this in C?

Many. To many. Use a search engine to search for more.

Note that xoring is generally not a good way (character sets don't have an
uniform distribution, and languages are worse), nor is just adding.

A better way would be:
- Use srand() with a known value, or use some other method to get a pseudo
random stream. This needs not be perfect, but should be some good. standard
C/C++ rand/srand should do.
- Xor the string with this stream. This mixes up the bits a bit, and makes
the distribution more uniform.
- divide into 40 byte blocks, use some padding if needed. Multiply those
blocks and use the lower 40 bytes of the result. (extract the lower 40
bytes after each multiplication).

Note that I just thought this up, so it's bound to be not very good, but
it'll be better than simplistic methods.

Other than that, ask in a cryptography newsgroup, read "Applied
Cryptography" by Bruce Schneier. There you'll find how to really do this
kind of stuff.

HTH,
M4
--
alt.comp.lang.learn.c-c++ FAQ:
http://snurse-l.org/acllc-c++/faq
--



Sat, 06 Dec 2003 22:59:46 GMT  
 Hashing long input strings into short output strings

Quote:


> > I need a way to hash long input strings, up to 1600 chars or so, into
> > short output strings, around 40 chars or so.

> > Is there a good hash technique for doing this in C?

> Many. To many. Use a search engine to search for more.

> Note that xoring is generally not a good way (character sets don't have an
> uniform distribution, and languages are worse), nor is just adding.

> A better way would be:
> - Use srand() with a known value, or use some other method to get a pseudo
> random stream. This needs not be perfect, but should be some good.
standard
> C/C++ rand/srand should do.
> - Xor the string with this stream. This mixes up the bits a bit, and makes
> the distribution more uniform.
> - divide into 40 byte blocks, use some padding if needed. Multiply those
> blocks and use the lower 40 bytes of the result. (extract the lower 40
> bytes after each multiplication).

> Note that I just thought this up, so it's bound to be not very good, but
> it'll be better than simplistic methods.

> Other than that, ask in a cryptography newsgroup, read "Applied
> Cryptography" by Bruce Schneier. There you'll find how to really do this
> kind of stuff.

I like the methods you came up with.  But my question, on reading this post
is what is the OP trying to do?  Clearly 40 bytes is much too big for a
hashed address, even in this day and age.  So, given the hashed result, what
does he plan on doing with it?  I hate to ask (and hate even more to be
asked) questions like "Why do you want to do that?" but I can't resist this
time.
--



Sun, 07 Dec 2003 04:27:17 GMT  
 Hashing long input strings into short output strings

Quote:

> I need a way to hash long input strings, up to 1600 chars or so, into
> short output strings, around 40 chars or so.

> Is there a good hash technique for doing this in C?

> Thanx,
> Brandon
> --


To be more precise,

I have net names (signal names) in a schematic design. These are
generated name so some of them can get out of hand, i.e. 1600 characters
long. I need these to be under 40 characters or so. Avoiding collisions
are more important than speed. So given a string, I need to convert it
to something under 40 char. Placing '\0' in the 40th position won't work
because net names may only differ by a few characters or so. The first
40 chars could be the same, but its the last 40 chars that differ.

Brandon
--



Mon, 08 Dec 2003 00:07:24 GMT  
 Hashing long input strings into short output strings

Quote:

>  I hate to ask ... questions like "Why do you want to do that?"
> but I can't resist this time.

Who is John Galt?
--



Mon, 08 Dec 2003 00:08:32 GMT  
 Hashing long input strings into short output strings

Quote:

> > I need a way to hash long input strings, up to 1600 chars or so, into
> > short output strings, around 40 chars or so.

> I have net names (signal names) in a schematic design. These are
> generated name so some of them can get out of hand, i.e. 1600 characters
> long. I need these to be under 40 characters or so. Avoiding collisions
> are more important than speed. So given a string, I need to convert it
> to something under 40 char. Placing '\0' in the 40th position won't work
> because net names may only differ by a few characters or so. The first
> 40 chars could be the same, but its the last 40 chars that differ.

Brandon...

I infer that your signals are on fairly busy silicon. You may be able to
translate the characters in the signal names to a 6-bit encoding, then
compress by discarding the hi-order two bits of each character. Note that
this provides only a 25% reduction, so now you'll have to deal with
1200-character names. That's still a long way from 40-character signal
names!

It looks rather like you might be better off to build a symbol table and
substitute the table entry number for the signal name before processing, and
do the reverse when humans need to examine the results.

HTH
--
Morris Dovey
West Des Moines, Iowa USA

--



Mon, 08 Dec 2003 23:17:32 GMT  
 Hashing long input strings into short output strings

Quote:


> > I need a way to hash long input strings, up to 1600 chars or so, into
> > short output strings, around 40 chars or so.

> > Is there a good hash technique for doing this in C?

> How fast must it be?  How important is it to avoid collisions?
> If the answers are "not very", and "very much", use an md5sum
> (google for that if you don't know what one is).  If the answers
> are "very" and "not so much", take a look at this page:
>         http://burtleburtle.net/bob/hash/doobs.html

I found GNU version of md5sum. This looks like it will work very well.
Unlike other versions, it allows input from a string instead of a file,
with the --string=string option. The output is 32 chars long no matter
the size of the input string.

Thanks for all the input.

I apologize to all if I had posted this in the wrong group to start.

Brandon
--



Mon, 08 Dec 2003 23:18:00 GMT  
 Hashing long input strings into short output strings


Quote:

> >  I hate to ask ... questions like "Why do you want to do that?"
> > but I can't resist this time.

> Who is John Galt?

John Galt is a riddle wrapped in a mystery inside an enigma.

------------
"Elegance is for tailors"  - A. Einstein
--



Mon, 08 Dec 2003 23:18:14 GMT  
 Hashing long input strings into short output strings

Quote:

> I have net names (signal names) in a schematic design. These are
> generated name so some of them can get out of hand, i.e. 1600 characters
> long. I need these to be under 40 characters or so. Avoiding collisions
> are more important than speed. So given a string, I need to convert it
> to something under 40 char. Placing '\0' in the 40th position won't work
> because net names may only differ by a few characters or so. The first
> 40 chars could be the same, but its the last 40 chars that differ.

I guess why you want to avoid "collisions" is that you plan
to substitute the "hash" for the original name and need to
maintain uniqueness.  The way we dealt with similar needs in
MUVES was to invent a "name pool" package that mapped
arbitrary strings to integer indices.  (Much nicer than
40-character strings.)  Here is the interface documentation:

[ Nm.pkg 3K ]
        <Nm.h> -- MUVES "Nm" (name management) package definitions

        The Nm package supports multiple simultaneous "name pools",
        where a name pool consists of a set of character strings and
        associated small non-negative integral index numbers.  A name
        may be referred to by either the index number or the string,
        whichever best fits the application.  Each separate name pool
        is accessed via a "handle" of type NmPool provided by the
        application.  A valid empty name pool can be established
        simply by default C initialization of a static NmPool variable;
        otherwise, invoke NmInit() to initially establish the pool as
        empty.  NmClear() should be invoked to deallocate internal pool
        storage when it is no longer needed.

        Nm functions that return an error indication first register an
        appropriate error index via ErSet().  Note that validity
        checks are not performed unless DEBUG is in effect.

        void    NmInit( NmPool *pool )

        NmInit() initializes a name pool.  Nothing is assumed about the
        validity of its previous contents.

        A statically-allocated NmPool is guaranteed to be already
        properly initialized by the C implementation, so NmInit() is not
        necessary in such a case (although it does no damage).  NmInit()
        is intended for initialization of dynamically-allocated NmPools.

        Once the name pool has been properly initialized, NmClear()
        should be used whenever it is desired to re-initialize it, in
        order to ensure that storage dynamically allocated by the "Nm"
        package for the name pool is properly freed.

        void    NmClear( NmPool *pool )

        NmClear() empties an already-valid name pool.  Associated
        dynamically-allocated internal storage is freed.  Upon return,
        the name pool has been freshly initialized.

        If the name pool has not been previously initialized, NmInit()
        should be used instead.

        bool    NmDup( const NmPool *old, NmPool *new )

        NmDup() duplicates an existing name pool.  The new name pool,
        which must also already exist, should be empty before the call
        to NmDup(); otherwise, memory previously allocated for it will
        become forever inaccessible.  NmDup() returns true if the
        contents of the old name pool are successfully copied into the
        new one; if insufficient memory is available for the new name
        pool or if the old name pool is found to be corrupted, NmDup()
        returns false.

        int     NmIndex( const char *name, NmPool *pool, bool enter )

        NmIndex() finds a name in the specified name pool and returns
        its index number.  If the "enter" argument is true, NmIndex()
        will register the name if it is not already in the pool;
        otherwise, an unregistered name is considered an error and -1 is
        returned.  -1 is also returned if memory cannot be allocated.

        The returned index number remains valid until the pool is
        cleared or re-initialized.

        const char      *NmName( int index, const NmPool *pool )

        NmName() finds in the specified name pool the name string that
        corresponds to the given index (which must have been obtained
        from a previous call to NmIndex()) and returns a pointer to it.

        int     NmCount( const NmPool *pool )

        NmCount() returns the number of entries in the specified name
        pool.

        An empty (e.g. newly initialized) pool has 0 entries.



Mon, 08 Dec 2003 23:21:39 GMT  
 Hashing long input strings into short output strings
Hi,

Maybe a silly suggestion, but I figured I might go ahead anyway.  Sometimes
the "upper" part of data is more random than the "lower" part, so you might
suggest taking the last 40 characters.  This will work for anything under 40
characters, and will catch things like "sequals".

LecturaX

Quote:
> > I need a way to hash long input strings, up to 1600 chars or so, into
> > short output strings, around 40 chars or so.

> > Is there a good hash technique for doing this in C?

> > Thanx,
> > Brandon
> > --

> To be more precise,

> I have net names (signal names) in a schematic design. These are
> generated name so some of them can get out of hand, i.e. 1600 characters
> long. I need these to be under 40 characters or so. Avoiding collisions
> are more important than speed. So given a string, I need to convert it
> to something under 40 char. Placing '\0' in the 40th position won't work
> because net names may only differ by a few characters or so. The first
> 40 chars could be the same, but its the last 40 chars that differ.

--



Fri, 12 Dec 2003 14:02:24 GMT  
 Hashing long input strings into short output strings
Hi,

Maybe a silly suggestion, but I figured I might go ahead anyway.  Sometimes
the "upper" part of data is more random than the "lower" part, so you might
suggest taking the last 40 characters.  This will work for anything under 40
characters, and will catch things like "sequals".

LecturaX

Quote:
> > I need a way to hash long input strings, up to 1600 chars or so, into
> > short output strings, around 40 chars or so.

> > Is there a good hash technique for doing this in C?

> > Thanx,
> > Brandon
> > --

> To be more precise,

> I have net names (signal names) in a schematic design. These are
> generated name so some of them can get out of hand, i.e. 1600 characters
> long. I need these to be under 40 characters or so. Avoiding collisions
> are more important than speed. So given a string, I need to convert it
> to something under 40 char. Placing '\0' in the 40th position won't work
> because net names may only differ by a few characters or so. The first
> 40 chars could be the same, but its the last 40 chars that differ.

--



Fri, 12 Dec 2003 14:11:06 GMT  
 
 [ 13 post ] 

 Relevant Pages 

1. Hashing long input strings into short output strings

2. convert string to long, and long to string

3. template and CMap Trying to make a Hash table for Hashing a String to an Int

4. short/long/long long Formatting questions

5. input, output, input/output parameters?????

6. format string for long long

7. 32-bit shorts (was a long long title)

8. newbie needs help to strings, strings, strings

9. string hash values

10. efficient, clean string hashing in c++...

11. Cryptography, Hashing and Strings

12. String hash function

 

 
Powered by phpBB® Forum Software