MMS-HASH Coding 
Author Message
 MMS-HASH Coding

Earlier this year there was a request for Tom Dowling's Hash function
as used by MMSFORTH in 1981.  In doing spring house-cleaning I came
across the Forml Proceedings in which it appeared.

It is short enough to transcribe and post as the next article in this
thread.

After that I will go into what I did to understand and appreciate it.
--
Procedamus in pace, Wil



Sun, 12 Oct 1997 03:00:00 GMT  
 MMS-HASH Coding

1981 FORML PROCEEDINGS, Volume 1, p. 127f.

             HASH-ENCODED FORTH NAME FIELDS: ONE YEAR OF USE
                Tom Dowling, Miller Microcomputer Services
           61 Lake Shor Road, Natick, MA 01760 (617/653-6136)

        A hash-encoded name field algorithm (1) was presented by the
author at the ROCHESTER FORTH STANDARDS CONFERENCE, May 12-15, 1981.
After a year of use in several projects involving greater then [sic]
100 screens of source code, no major problems have be encountered.

(1) "Hash-Encoded FORTH Name Fields" by Tom Dowling,
    1981 ROCHESTER FORTH STANDARDS CONFERENCE, MAY 12-15, 1981,
    pp. 229-230.

        In the thousands of words defined in these projects the only
collision has been between words of the same size ending in the
suffixes BUFF and SIZE ( RCV-SIZE RCV-BUFF );  I now use the suffix
BUFR instead of BUFF.  This problem was not of major consequence
because warnings of "worm words" are displayed.  The advantages of
this algorithm, no time or space penalties for long names, have
encouraged the use of long descriptive names in all projects.

        The paper in the ROCHESTER FORTH STANDARDS CONFERENCE PROCEEDINGS
was printed with errors.  Therefor I will reiterate the method and
present the algorithm in high level code.  The MMSFORTH Version 2.0
system uses this algorithm for its name field encoding and its
effect on compilation speed is minimal.

        This algorithm was developed to offer 31-character name
uniqueness while still maintaining the space and compilation speed
advantages of fixed size name fields.  The algorithm also offers the
advantage of full 4-character names if the fourth character is from
the first 64 ASCII printable characters.  The algorithm also
guarantees uniqueness if any single character of the name field is
different and from the first 64 printable ASCII characters;  e.g.,
VAR-X1DCMPLX and VAR-X2-CMPLX are unique.

        The algorithm generates a 4-byte name field from names consisting
of 1 to 31 characters from the 96 printable [sic] ASCII codes.  The name
field consists of the name length, the first three characters, and a
hash character derived from all additional characters in the name.
Three ASCII characters treated as base-96 digits require 20 bits;
the name length field, limited to 31 characters, requires 5 bits;
one bit is required to indicate immediate words; leaving 6 bits for
the hash character.  This hash character will be chosen from the
first 64 printable ASCII codes.  This system guarantees that all
names which were unique using the older length and first 3
characters will also be unique in this implementation.  Duplicate
names will be flagged with a warning.  A printout of the algorithm
follows:

( d -> 96*d )
: 96* 2DUP 2DUP D+ D+ 2DUP D+ 2DUP D+ 2DUP D+ 2DUP D+ 2DUP D+ ;
( adr of string -> double-word-encoded-representation )



            DO  2*  DUP 63 > IF  1+  THEN

            LOOP
      ELSE  0

: XDECODE   ( Encoded-rep -> ) 3 SPACES 1024 /MOD 4 .R SPACE
  16 /MOD >R 96 U/MOD 96 /MOD 32 + EMIT 32 + EMIT 32 + EMIT
  R> SPACE 32 + EMIT SPACE  ;
( use as XX word )
: XX   BL WORD XENCODE XDECODE ;



Sun, 12 Oct 1997 03:00:00 GMT  
 MMS-HASH Coding

(     MMS-HASH After No-Brain Re-Phrasing

In trying to understand this algorithm my first step was to re-phrase
it.  FigForth or Forth-79 `U/MOD' was changed to `UM/MOD'.  My
convention for stack-state comments was introduced.  A minimum number
of stack-comments were added.  It is still write-only but I can begin
to see the forest and not a pile of trees.

The inscrutability of the code is not a criticism.  The original was
written when all source code was on blocks, and had to fit in the
16 by 64 bed of Procrustes.

)

: 96*                                     ( d . -- 96*d . )
      2DUP  2DUP D+  D+  2DUP D+  2DUP D+  2DUP D+  2DUP D+  2DUP D+
;

: XENCODE         ( adr-of-string -- double-word-encoded-representation )
       >R                                       ( )

            DO                                  ( d .)

            LOOP

            IF

                  DO                            ( d . hash)
                        2*  DUP  63 >  IF    1+    THEN

                  LOOP
            ELSE                                ( d .)
                  0
            THEN                                ( d . hash)

;

: XDECODE                                 ( Encoded-rep -- )
      3 SPACES  1024 /MOD  4 .R    SPACE
      16 /MOD  >R                              
            96 UM/MOD  96 /MOD  32 +  EMIT  32 +  EMIT 32 +  EMIT ( )
      R>                                        ( hash)
      SPACE    32 +  EMIT     SPACE             ( )
;

: XX                                      ( Use as `XX word'. )
      BL WORD  XENCODE  XDECODE
;



Sun, 12 Oct 1997 03:00:00 GMT  
 MMS-HASH Coding

Quote:
>( d -> 96*d )
>: 96* 2DUP 2DUP D+ D+ 2DUP D+ 2DUP D+ 2DUP D+ 2DUP D+ 2DUP D+ ;
>( adr of string -> double-word-encoded-representation )



>            DO  2*  DUP 63 > IF  1+  THEN

>            LOOP
>      ELSE  0

>: XDECODE   ( Encoded-rep -> ) 3 SPACES 1024 /MOD 4 .R SPACE
>  16 /MOD >R 96 U/MOD 96 /MOD 32 + EMIT 32 + EMIT 32 + EMIT
>  R> SPACE 32 + EMIT SPACE  ;
>( use as XX word )
>: XX   BL WORD XENCODE XDECODE ;

What a beautiful example!  Thanks!

I noticed how I studied this code.  First, I look at 96* .  It looks
like it's supposed to multiply a double number by 96.  First it does
2DUP 2DUP D+ D+ which clearly multiples by 3.  Then it does 2DUP D+ 5
times, to multiply by 2 5 times.  Looks good.  I quick write

: 96* D2* D2* D2* D2* D2* 2DUP D2* D+ ;

which should do the same thing.  Shorter code, it may run faster or
slower depending.  D2* is in the double wordset so it's less portable.

Quote:
>( adr of string -> double-word-encoded-representation )



>            DO  2*  DUP 63 > IF  1+  THEN

>            LOOP
>      ELSE  0


Most of the verbiage here comes from a complicated process of setting DO
LOOP indices.  The contents don't look that bad, the first loop converts
at most 3 characters to a base-96 double number.  If there are more than
3 characters, the 2nd loop hashes them into 6 bits.  Finally it gets the
string count and puts it in high bits in the double number.

: XENCODE ( ca -- d )
                 >R




( d 0 n )          2* DUP 63 > IF 1+ THEN


This isn't much better.  Maybe worse.  The loop indices are still
complicated.  Different, but not much easier to read.  It's usually
wasted effort for me to rewrite other people's code to make it "better".
"If it ain't broke, don't fix it."  The other guy's code is very well
tested, mine is not, yet.  In rewriting I might make changes that give
subtly different performance in a few cases.  If some people use my
version and others use his we could get a maintenance nightmare.  But
the exercise has one great virtue -- after I rewrite his code and test
that the results are the same in every case I can find out about, I have
a much deeper understanding of his code.

Quote:
>: XDECODE   ( Encoded-rep -> ) 3 SPACES 1024 /MOD 4 .R SPACE
>  16 /MOD >R 96 U/MOD 96 /MOD 32 + EMIT 32 + EMIT 32 + EMIT
>  R> SPACE 32 + EMIT SPACE  ;

: XDECODER ( d -- ca 4 u )
( d )      <# 1024 /MOD >R
( d )      16 /MOD 32+ HOLD
( d )      96 UM/MOD SWAP 32 + HOLD
( q )      96 /MOD SWAP 32 + HOLD
( q )      32 + HOLD 0 0 #>
( ca 4 )   R> ;

: XDECODE ( d -- ) XDECODER >R TYPE SPACE R> 4 .R ;

I might want to do something else with the decoded name than print it
immediately.  But where should it be stored?  My first thought was the
output buffer, but that has to be used pretty quick.  I had to print it
before printing the count.  I could allot a string and use that, which
is not allowed by the standard while a definition is in progress.  ( Are
there systems where it absolutely fails, even if you -ALLOT the space
later?)  Use the PAD which isn't re-entrant and isn't there on some
systems? Nothing looked just right.

: XDECODER ( d -- 4chars u )
( d )                    1024 /MOD >R
( d )                    16 /MOD 32 + ROT ROT
( char4 d )              96 UM/MOD 32 +UNDER
( char4 char3 q )        96 /MOD 32 +UNDER
( char4 char3 char2 q2 ) 32 +
( c4 c3 c2 c1 )          R> ;

: XDECODE ( d -- ) XDECODER 4 .R SPACE EMIT EMIT EMIT EMIT SPACE ;

Leaving it on the data stack doesn't look great either.  5 items.  But
you can print them or compare them.  I don't know what's best.  Maybe
just print them out and worry about something else when you need it.

In math class they said you don't even have an illusion of communication
unless you can say it back in your own words and the other guy agrees
you got it right.  If I can say it to the computer in my own words and
the computer says it matches, that's similar.

Maybe I don't need that much understanding.  If I can follow the stack
diagram and the docs, I can use it without understanding it in detail.
But as soon as I use it for something the other guy didn't intend, it's
my lookout whether it works; he could have taken shortcuts that don't
work for me.



Mon, 13 Oct 1997 03:00:00 GMT  
 MMS-HASH Coding

Quote:

>                                                   I use two
>off-the-shelf definitions from my personal library.
>)

>: BOUNDS        OVER +  SWAP                    ( a n -- a+n a )

>: ??    BL WORD COUNT S" IF ~ THEN" EVALUATED ; IMMEDIATE
>( See the current issue of Forth Dimensions for this. )

The ~ word looks very interesting. I won't be able to get
the current FD (I subscribed a week ago), so maybe you can
give me a quick hint?




Mon, 13 Oct 1997 03:00:00 GMT  
 MMS-HASH Coding

: The ~ word looks very interesting. I won't be able to get
: the current FD (I subscribed a week ago), so maybe you can
: give me a quick hint?

From Forth Dimensions.

Macro Processing for Forth
Wil Baden
Costa Mesa, California

The most popular programming language automatically pre-
processes source through a macro processor before compiling. This
augments the power of the language tremendously. For a long time I
wanted macro processing for Forth. The problem for Forth is
aggravated because it has an additional complication that macros
must work when interpreting as well as compiling.

Like so many things in life that you keep putting off, when you
finally get around to doing it, it's easy.  All that is needed is a
variation of EVALUATE that will make substitutions for a parameter
flag before evaluating.  I call this `EVALUATED'.  Given `EVALUATE',
EVALUATED is easy to write.  EVALUATED makes substitutions in a
string, and then uses `EVALUATE'.  See Listing One.

I use the swung dash `~', often miscalled `tilde', as the parameter
flag. As such I call it `twiddle' or `parameter'. This seems the least
likely character to conflict with existing Forth words.  If a swung
dash is needed in a macro, `S" ~"' can be used as a parameter.

The actual parameters for a macro are selected by looking ahead in
the input source.  Words like `PARSE-WORD', defined

        : PARSE-WORD ( -- string . ) BL WORD COUNT ;

are used to pick up an actual parameter.

Within a macro-template, twiddle `~' is used as a place-holder for
a parameter.  The actual parameters are character-string (c-addr len)
pairs on the stack.
--
Procedamus in pace, Wil



Mon, 13 Oct 1997 03:00:00 GMT  
 MMS-HASH Coding
I shouldn't be doing this since I use Pygmy, but:

: 3HASH  ( 0. a u - d)  3 MIN  OVER + SWAP

: ?HASH  ( 0. a u - 0 n)  DUP  3 >
   IF  OVER + SWAP  3 +

   ELSE 2DROP THEN ;
: UHASH  ( u - n)  64 * ;
: XENCODE  ( $ - d)  COUNT 2>R


        R> UHASH +  16 *  D+
        R> DROP ;

Again, I consider this an exercise.  It is a strength of
Forth that it allows many ways of talking to computers.

-

  New York State Department of Civil Service
  Albany, NY  12239



Tue, 14 Oct 1997 03:00:00 GMT  
 MMS-HASH Coding
If we start

: XENCODE  ( $ - d)  COUNT 2>

I think the rest can be laid out quite nicely and factored to taste.

As an aside, XENCODE might be a good opportunity for Sergeant and
Charlton to demonstrate the charms of their preferred loops.

-

  New York State Department of Civil Service
  Albany, NY  12239



Tue, 14 Oct 1997 03:00:00 GMT  
 MMS-HASH Coding
Of course I meant

: XENCODE  ( $ - d)  COUNT 2>R

TGIF.
-

  New York State Department of Civil Service
  Albany, NY  12239



Tue, 14 Oct 1997 03:00:00 GMT  
 MMS-HASH Coding
: Of course I meant

: : XENCODE  ( $ - d)  COUNT 2>R

Leo reminds me that the next step is to factor.  Such
polishing is supererogatory, but may help in the future.

What impresses me now is how unacceptable this code would have
been when the paper was printed.

: hash-first-3-characters                 ( string . -- n )
      0 ROT ROT                                 ( n string .)
      3 MIN  BOUNDS  DO                         ( n)
            96 *

      1 CHARS +LOOP
;

: hash-remaining-characters               ( string . -- hash )
      DUP  3 >  NOT
            IF    2DROP    0  EXIT    THEN
      0 ROT ROT                                 ( hash string .)
      BOUNDS  3 +  DO                           ( hash)
            2*
            DUP  63 >  ?? 1+

            63 AND
      1 CHARS +LOOP
;

: XENCODE         ( adr-of-string -- 31-bit-encoded-representation )
      DUP COUNT  hash-first-3-characters        ( adr n)
      OVER COUNT  hash-remaining-characters     ( adr n hash)

      20 LSHIFT
      +                                         ( hash)
;



Tue, 14 Oct 1997 03:00:00 GMT  
 MMS-HASH Coding

writes:

Quote:
>( d -> 96*d )
>: 96* 2DUP 2DUP D+ D+ 2DUP D+ 2DUP D+ 2DUP D+ 2DUP D+ 2DUP D+ ;
>( adr of string -> double-word-encoded-representation )



>            DO  2*  DUP 63 > IF  1+  THEN

>            LOOP
>      ELSE  0

>: XDECODE   ( Encoded-rep -> ) 3 SPACES 1024 /MOD 4 .R SPACE
>  16 /MOD >R 96 U/MOD 96 /MOD 32 + EMIT 32 + EMIT 32 + EMIT
>  R> SPACE 32 + EMIT SPACE  ;
>( use as XX word )
>: XX   BL WORD XENCODE XDECODE ;

Here's how I would write this little thing in my forth
in a production environment:

--------------------------------------------------
Multiply double by 96
--------------------------------------------------
: 96d* ( d -- 96*d )
      2dup
         2dup d+ ( *2)
      d+      ( *3)
      2dup d+ ( *6)
      2dup d+ ( *12)
      2dup d+ ( *24)
      2dup d+ ( *48)
      2dup d+ ( *96) ;

-----------------------------------------------------------------
Encode dictionaly headers to a double word hash code
where hash code is 32 bits:
      cccccc                              -- string count 1..31
            hhhhhh                        -- upper x-3 chars hashed
                  xxxx xxxxxxxxxxxxxxxx   -- lower 3 chars encoded
-----------------------------------------------------------------
: XENCODE  ( adr-of-string -- double-word-encoded-representation )
      toint str
      --------------------------------------------------
      encode first 3 chars
            a+96*b+96*96*c = d8000h max ( 20 bits )
                  xxxx xxxxxxxxxxxxxxxx   -- lower 3 chars encoded
      --------------------------------------------------
      0 0

         96d*

      loop

      --------------------------------------------------
      encode rest of chars in 6bit hash code
            hhhhhh                        -- upper x-3 chars hashed
      --------------------------------------------------
      0 0

         2*
         dup 3fh > if 1+ then

         3fh and
      loop

      --------------------------------------------------
      put it all together
      cccccc                              -- count 1..31
            hhhhhh                        -- upper x-3 chars hashed
                  xxxx xxxxxxxxxxxxxxxx   -- lower 3 chars encoded
      --------------------------------------------------

When I adopted local variables (TOINT <varname>), I also re-wrote the DO.

Instead of "<END> <START> DO" I adopted the concept of

   <start> <count> DOWITH <index> <body> LOOP

   -- also known to forthers as BOUNDS DO --

A new local variable named <index> is created with the initial
value of <start>. LOOP will increment <index> by 1 and +LOOP can
increment (of decrement) by other values. The loop is executed
exactly <count> times. The advandate is that +LOOP modifies
the index, but has no side effect on the number of times the
loop executes. This is predetermined by <count>. You can also
change the value of <index> inside the loop with TO. The advantage
of a named variable as the loop index is that it's name does not
change in nested loop. ( I,J,K etc.). The name is taken out of
context when the LOOP finishes.

If you don't need an index, I also implement <count> TIMES <body> LOOP.

The DOWITH ALWAYS check for negative or zero counts,
and will not execute the loop. I've experiments with things
like ?DOWITH, but i found myself assuming i was using ?DOWITH
when I typed DOWITH. Since there is no harm in letting DOWITH
check for count <= 0 it does it by default.

The long horizontal bars start and end comments. This way comments
are nicely separated from code. One bar start a comment and the next bar
stops it.
(don't add bars in your comment!). This is an alternative to adding
comments
at the end of lines which always get messed up after editing the code.

I also use -- as an end-of-line comment (stolen from hypercard).

Other words that might be non-standard:

bl    ( -- 32 )             -- ASCII for space
3fh   ( -- 63 )             -- all numbers ending in h are hex (no need

s+    ( a l o -- a+o l-l )  -- add an offset to a start-len pair on the
stack


Digital Matrix Systems, Inc. Richardson,TX 1-800-DMS-1800



Tue, 14 Oct 1997 03:00:00 GMT  
 MMS-HASH Coding

Quote:
>Here's how I would write this little thing in my forth
>in a production environment:
>--------------------------------------------------
>Multiply double by 96
>--------------------------------------------------
>: 96d* ( d -- 96*d )
>      2dup
>         2dup d+ ( *2)
>      d+      ( *3)
>      2dup d+ ( *6)
>      2dup d+ ( *12)
>      2dup d+ ( *24)
>      2dup d+ ( *48)
>      2dup d+ ( *96) ;

< snip>

Your documentation is very impressive!  I found that _very_ readable.

Here is a minor thing that touches on matters we've been discussing.
What are the advantages and disadvantages of writing

: 96d* ( d -- 96*d ) 96 0 M*/ ;

or if you have it ( not standard)

: 96d* ( d -- 96*d ) 96 UMD* ;

The old way, we're making 14 calls and have something that needs to
be explained.  The 1st of these other ways is much shorter (but still
may need explanation for people who don't keep track of arcane double
operations).  A second problem is that this does a triple
multiplication followed by a triple division.  It would be slow on
some systems.  On other systems 11 extra calls might be slower.

And if speed is critical, why not write UMD* in assembler rather than
do a complicated stack-dance?  Or even write 96d* in assembler?  The
original version looks like the sort of micro-optimisation that
people say isn't worth doing.  Or maybe sometimes it is worth doing.

No criticism intended of anybody, I just want to consider the question.

Quote:
>Other words that might be non-standard:
>bl    ( -- 32 )             -- ASCII for space
>3fh   ( -- 63 )             -- all numbers ending in h are hex (no
>need for BASE)

>s+    ( a l o -- a+o l-l )  -- add an offset to a start-len pair on
>the stack

Uppercase BL is standard.  Lowercase bl has an environmental

name is changed.  s+ is /STRING

One other difference -- your "shift" is in the standard as RSHIFT and
LSHIFT .  Your way is portable, if I guess right what your way is:

: shift ( x n -- x' ) DUP 0< IF RSHIFT ELSE LSHIFT THEN ;

You are not writing for portability to other people's compilers, but
I'll also point out that you have assumed that a character is one
address unit (which seems to be the case on most systems).  Also
this is inherently a 16-bit algorithm.  It gets a 32-bit result in
whatever size double numbers are.  It should run correctly on a
32-bit system, giving a 64-bit result with the top item 0 .  Since a
32-bit system would require some change to use it anyway, it might
make sense to document an environmental dependency on 16-bit.  Then
on a larger system you'd have the choice of either using it as is and
throwing away the top, or rewriting it.

You could rewrite it as single-precision, or give the hash another 16
or 32 bits.  Or if you have some number of bits you want to use for
something else, give the hash less than 16 bits extra.

Thanks for posting!  I was surprised how much those
------------------------------------------
lines made the comments stand out.  I notice you have 2 lengths, and
it looks like both of them are longer than 31 characters.  Do you
have a special way to do that?



Wed, 15 Oct 1997 03:00:00 GMT  
 MMS-HASH Coding
W.Baden:

                HASH-ENCODED FORTH NAME FIELDS: ONE YEAR OF USE
                   Tom Dowling, Miller Microcomputer Services
...
   ( d -> 96*d )
   : 96* 2DUP 2DUP D+ D+ 2DUP D+ 2DUP D+ 2DUP D+ 2DUP D+ 2DUP D+ ;

Hmm...

Personally, I tend to prefer 31* when doing hashing.

Perhaps

: D2* 2DUP D+ ;                                 \ d -> 2*d
: D31* 2DUP D2* D2* D2* D2* D2* 2SWAP D2- ;     \ d -> 31*d

Many people probably already have D2*.

Raul D. Miller



Wed, 15 Oct 1997 03:00:00 GMT  
 MMS-HASH Coding

Here is my 32-bit version, currently used im my attempt writing a OS/2-
Forth:

\ hash encoding
: XENCODE  ( c-addr u -- 31-bit-encoded-representation )




        ?DO  2*  DUP  63 >  -                \ e.g. IF  1+  THEN

        LOOP
    THEN  2R> NIP  6 LSHIFT  +  20 LSHIFT  + ;

\ hash decoding ( 3 chars + hash char + ___ )
: XDECODE  ( 31-bit-encoded-rep -- )
    [ HEX  ]  7FFFFFFF   AND                    \ clear IMMEDIATE bit
    4000000  /MOD  >R                           \ save count
    100000   /MOD  SWAP
    [ DECIMAL ]  96 /MOD  96 /MOD
    32 +  EMIT  32 +  EMIT  32 +  EMIT          \ first 3 characters
    32 +  EMIT                                  \ hash character
    R>  4 MAX  4 ?DO  [CHAR] _  EMIT  LOOP ;    \ fill trailing chars with _

\ print the name pointed by nfa

There is a significant difference in XENCODE: It needs a String/Count  
argument.

Greetings Bernd
## CrossPoint v3.02 ##



Sun, 26 Oct 1997 03:00:00 GMT  
 
 [ 14 post ] 

 Relevant Pages 

1. Hash.new(Hash.new) doesn't use Hash.new as default value

2. Documentation for MMS Forth for TRS-80 Model One or Three

3. MMS Forth

4. FORTHWRITE by MMS

5. MMS Forth

6. Hash coding

7. Tutotial on Hash codes

8. hash table code in modula/oberon

9. Sample Hashing Code

10. hash coding

11. obj.hash now == obj.hash after?

12. Hash compression (Hash 'consing') circa 1957

 

 
Powered by phpBB® Forum Software