String HashTables 
Author Message
 String HashTables

Hi All:

    I am currently interested in the problem of a very large dynamic
string hashtable.  That is, a data object that performs storage,
retrieval, removal, and hashing lookup on a large, variable number of
variable length strings (together with garbage collection, etc.).  I've
been looking around for code to handle this sort of thing, but have been
unsuccessful.  I guess one of the issues is that the primary use for
this sort of code is in high-performance compilers, servers, and
databases, and as such constitutes IP which people are a little hesitant
to part with, understandable.  Most of the free distribution code I've
looked at uses fixed sizes and numbers of strings (that's not bad, just
not what I'm looking for).
    My first question is where I could find some similar code, even if
it's a little basic.  I do have some prototype code, but it's pretty
bad.  Even code in C might be helpful (maybe in the distribution for
GCC?), if someone could give me a link to look at that would help.
    My second question is on the memory allocation issue.  One approach
is to let the OS handle it (specifically, I'm using StonyBrook on
WIN32), which would reduce the code considerably, as well as simplifying
the hashtable to a fixed array of pointers and values.  My only
hesitation is with respect to the performance issue; is going through
Windows for all memory allocation going to be a serious performance
hit?  Is handling memory "locally" going to be worth the effort of
writing the code?
    Anyway, any comments would be helpful, if I come up with anything
interesting I'll post it (this NG needs a little more traffic anyway).

                    -Z. Johnson



Tue, 04 Sep 2001 03:00:00 GMT  
 String HashTables

Quote:

>     I am currently interested in the problem of a very large dynamic
> string hashtable.

Perhabs following module that supports arbitrary long strings
could be something to start with:

| DEFINITION MODULE IdentSys;             (* Martin Hasch, Jan 1989 *)
|
|    TYPE
|       Identifier; (* as unique as the original identifier *)
|
|    (* storage of identifiers *)
|
|    PROCEDURE PutIdentChar(ch: CHAR);
|
|    PROCEDURE PutIdent(VAR id: Identifier);
|
|    (* retrieval of identifiers *)
|
|    PROCEDURE GetIdent(id: Identifier; VAR ident: ARRAY OF CHAR);
|
|    PROCEDURE GetIdentChar(id: Identifier; VAR ch: CHAR);
|       (* last value of ch: 0C *)
|
|    PROCEDURE GetLength(id: Identifier; VAR length: CARDINAL);
|       (* return length of identifier *)
|
| END IdentSys.

Note that this variant assures that two strings are equal iff
their corresponding references are equal. For this reason, it
does not offer an operation for disposal. It has an internal
hash organisation and stores the strings together in large
tables (each 512 bytes long) without fragmentation:

           +-----+                       +-----+
   table-->|  o------------------------->| NIL |
           +---+-+-+---+---+---+---+     +---+-+-+---+---+---+---+---+
           | s | t | r | 1 | s | t |     | r | 2 | s | 3 |   |   |   |
           +---+---+---+---+---+---+     +---+---+---+---+---+---+---+
             ^               ^                     ^       ^
          +--|--+         +--|--+         +-----+  |       |
    id1-->|  o  |   id2-->|  o  |   id3-->|  o-----+       |
       +->+-----+      +->+-----+    +--->+-----+      endoftable
       |  |len=4|      |  |  4  |    |    |  2  |
       |  +-----+      |  +-----+    |    +-----+
       |  |  o---------+  | NIL |    |    | NIL |
       |  +-----+         +-----+    |    +-----+
       |                             |
       +-----------+                 |
   hash-  +-----+--|--+-----+-----+--|--+-----+-----+-----+-----+
   table  | NIL |  o  | NIL | NIL |  o  | NIL | NIL | NIL | NIL |
          +-----+-----+-----+-----+-----+-----+-----+-----+-----+

This module is part of our Oberon compiler (written in Modula-2)
that is distributed under the terms of the GPL. If you want to
have the corresponding module, just drop me a note.

In Oberon, we have a more general module named ConstStrings doing this
as part of our library. You'll find more about ConstStrings on
http://www.mathematik.uni-ulm.de/oberon/0.5/lib/man/ConstStrings.html
ConstStrings may be freely distributed under the terms of the LGPL.

Andreas.

--
Andreas Borchert, Universitaet Ulm, SAI, Helmholtzstr. 18, 89069 Ulm,  Germany

WWW:    http://www.mathematik.uni-ulm.de/sai/borchert/



Tue, 04 Sep 2001 03:00:00 GMT  
 String HashTables

Quote:

> Hi All:

>     I am currently interested in the problem of a very large dynamic
> string hashtable.  That is, a data object that performs storage,
> retrieval, removal, and hashing lookup on a large, variable number of
> variable length strings (together with garbage collection, etc.).  I've
> been looking around for code to handle this sort of thing, but have been
> unsuccessful.  I guess one of the issues is that the primary use for
> this sort of code is in high-performance compilers, servers, and
> databases, and as such constitutes IP which people are a little hesitant
> to part with, understandable.  Most of the free distribution code I've
> looked at uses fixed sizes and numbers of strings (that's not bad, just
> not what I'm looking for).
>     My first question is where I could find some similar code, even if
> it's a little basic.  I do have some prototype code, but it's pretty
> bad.  Even code in C might be helpful (maybe in the distribution for
> GCC?), if someone could give me a link to look at that would help.

Have you checked "The Modula-2 Software Component Library" by Charles Lins?
I think that you may find some useful code there. I don't have my copy of
the books here, so I can't check it for you. Here is a reference to the
books:

http://www.amazon.com/exec/obidos/Author=Lins%2C%20Charles/t/002-0618...

--
        -- Aron

+-------------------------------------------------------------------------+
|     _   NB: Om du vil svare med e-post,     |"Sometimes the magic works,|
| __/\_\      m? du fjerne "spam-block."      | sometimes it doesn't."    |
|/\_\/_/      fr? adressa mi.                 |                           |
|\/_/\_\  NB: If you want to reply by e-mail, |    -- Chief Dan George,   |
|   \/_/      you have to remove "spam-block."|       in "Little Big Man" |
|             from my address.                |                           |
+-------------------------------------------------------------------------+



Tue, 04 Sep 2001 03:00:00 GMT  
 String HashTables
Check Lin's books- these contain string handling modules as well
as hash table modules, and the overall library is quite powerful,
albeit a bit of a learning curve.

Also, I am not aware of these libs residing anywhere on the web,
and you may have difficulty tracking Lins down to obtain a copy
of the sources.  Further, the sources were designed for a Mac
version of M2, if memory serves.

A few years ago, I was able to track down Lins, and she was very
gracious in sending me a set of diskettes with the source code
for these libs.  The reason I'm mentioning all of this is, you
mentioned that you are using StonyBrook's compiler.  I ported the
Lins libraries to the StonyBrook compiler (16-bit version, but a
32-bit version shouldn't be too difficult to do) for my work.  I
ported all of the modules in the books.

I would be happy to make these available provided Ms. Lins is
agreeable.

-Mike



Tue, 04 Sep 2001 03:00:00 GMT  
 String HashTables

Quote:

> On Fri, 19 Mar 1999 00:03:52 +0000,

> >Hi All:

> >    I am currently interested in the problem of a very large dynamic
> >string hashtable.  That is, a data object that performs storage,
> >retrieval, removal, and hashing lookup on a large, variable number of
> >variable length strings (together with garbage collection, etc.).  >
...Stuff deleted...
> Hmm, is this something that Perl, Python, Tcl or even awk might deal with?
> All source (C code, though) is available for these languages.  See:

> http://www.perl.org, http://www.python.org, http://www.scriptics.com

> for the first three.

> --
> William Burrow  --  New Brunswick, Canada             o
> Copyright 1999 William Burrow                     ~  /\
>                                                 ~  ()>()

Other languages also deal with arbitrary length string tables indexed by
arbitrary length strings. TRAC (prehistory!) and AWK (the so-called
'arrays') are classical examples.

C sources for 'gawk' are available at GNU repositories. GNU also have a
library module called 'gdbm' that handles thad kind of textual databases
(but not necessarily as hash tables).

Regards,
--
------------------------------------------------------------------------
Manuel Collado Machuca                    | Facultad de Informatica UPM
Universidad Politecnica de Madrid         | Campus de Montegancedo
Dep. LSIIS                                | Boadilla del Monte
Tel.+34-91-336.74.57 Fax.+34-91-336.74.12 | 28660  MADRID  -  SPAIN
------------------------------------------------------------------------



Fri, 07 Sep 2001 03:00:00 GMT  
 String HashTables

Quote:

> Check Lin's books- these contain string handling modules as well
> as hash table modules, and the overall library is quite powerful,
> albeit a bit of a learning curve.

> Also, I am not aware of these libs residing anywhere on the web,
> and you may have difficulty tracking Lins down to obtain a copy
> of the sources.  Further, the sources were designed for a Mac
> version of M2, if memory serves.

> A few years ago, I was able to track down Lins, and she was very
> gracious in sending me a set of diskettes with the source code
> for these libs.  The reason I'm mentioning all of this is, you
> mentioned that you are using StonyBrook's compiler.  I ported the
> Lins libraries to the StonyBrook compiler (16-bit version, but a
> 32-bit version shouldn't be too difficult to do) for my work.  I
> ported all of the modules in the books.

> I would be happy to make these available provided Ms. Lins is
> agreeable.

> -Mike

I'm very interested in getting a copy of the C. Lins library. Some time
ago I made an attempt (by ordinary mail, and asking Springer-Verlag)
without success. So, could anybody provide a copy of the sources? Or at
least give the present address of C. Lins.

Thanks,
--
------------------------------------------------------------------------
Manuel Collado Machuca                    | Facultad de Informatica UPM
Universidad Politecnica de Madrid         | Campus de Montegancedo
Dep. LSIIS                                | Boadilla del Monte
Tel.+34-91-336.74.57 Fax.+34-91-336.74.12 | 28660  MADRID  -  SPAIN
------------------------------------------------------------------------



Fri, 07 Sep 2001 03:00:00 GMT  
 String HashTables

Quote:

>     I am currently interested in the problem of a very large dynamic
> string hashtable...
>     My first question is where I could find some similar code, even if
> it's a little basic.

Feel free to take my Strings and Tables module from

http://www.mk.dmu.ac.uk/~gperkins/Modula2/Struct/

The whole string module (def, imp, and test) could be
useful.  The table module does not use hash implementation,
but maybe the def file would help.  To convert to hash
tables, find the proc types

TYPE Compare    = PROCEDURE( Key, Key ) : BOOLEAN;
     KeyAction  = PROCEDURE( Key );
     ItemAction = PROCEDURE( Item );
     Action     = PROCEDURE( Key, Item );

and add

     CardVal = PROCEDURE( Key ) : CARDINAL;

and find the creation function

(* Creation/Destruction *)
PROCEDURE New( Equal:Compare ) : Table;
  (* post: Result is empty table *)

and add a hash function to its parameters

(* Creation/Destruction *)
PROCEDURE New( Equal:Compare; hash:CardVal ) : Table;
  (* post: Result is empty table *)

and then write all the implementation code ;-)

Quote:
> My second question is on the memory allocation issue...
> hesitation is with respect to the performance issue; is going through
> Windows for all memory allocation going to be a serious performance
> hit?  

Check out some hash table literature.  As a *very* rough guide,
initialise the table to double the size you think you need.
Whenever it reaches 70% full, double it again.  Better still,
run some experiments with representative patterns of your own
data to find best balance of speed versus space versus re-alloc
load.


Fri, 07 Sep 2001 03:00:00 GMT  
 String HashTables
Hi All:

    Thank you all for your prompt and helpful replies.  I am working on
getting a web page up with some code of various types, I'll give a pointer
as soon as it's ready.

                    -Z. Johnson



Sat, 08 Sep 2001 03:00:00 GMT  
 String HashTables

Quote:

> Hi All:

>     I am currently interested in the problem of a very large dynamic
> string hashtable.  That is, a data object that performs storage,
> retrieval, removal, and hashing lookup on a large, variable number of
> variable length strings (together with garbage collection, etc.).  I've
> been looking around for code to handle this sort of thing, but have been
> unsuccessful.
> ...
>     My second question is on the memory allocation issue.  One approach
> is to let the OS handle it (specifically, I'm using StonyBrook on
> WIN32),
> ...
> Is handling memory "locally" going to be worth the effort of
> writing the code?
>                     -Z. Johnson

Hi,

I had started developing a generic (but not, unfortunately, ISO-generic)
hash table module
using StonyBrook about a year ago. I have not been able to spend the
time on it I wanted,
though I have been using it sucessfully. It may be too "C"y for people
on this list,
since abandons static checking, using ADDRESS to pass items for the hash
table. It can be
pretty efficient, however.

It does define some standard element types that don't require the user
to supply
either a Hash or a Compare function. These include null-delimited
strings and
"dynamic" null-delimited strings. Or you can use it with your own
RECORDs and
Hash and Compare functions.

For StonyBrook, the run-time support handles suballocation from bigger
heaps
requested from the OS, so I would think that would be ok. Check out the
DEF
and MOD for the SB module ExStorage - it gives you a good deal of
control
there.

You are welcome to the implementation code to use or just peruse.
I'd welcome feedback.

Tom Breeden

-------------------------------------------------------------------------------------------------------------------
-------------------------------------------------------------------------------------------------------------------

(*#######################*)
 DEFINITION MODULE HashT;   (* vx.x-082698 *)
(*#######################*)

FROM SYSTEM IMPORT ADDRESS, CAST;

TYPE  HashTable;

TYPE  HashFunc = PROCEDURE(ADDRESS, CARDINAL):CARDINAL;
TYPE  CmpFunc  = PROCEDURE(ADDRESS, ADDRESS):INTEGER;

CONST NilHashFunc = CAST(HashFunc, NIL);
      NilCmpFunc = CAST(CmpFunc, NIL);

TYPE  HashElemTypes = (Str0HT, IntHT, CardHT, LongIntHT, LongCardHT,
FloatHT,
                       DynStrHT, PasStrHT, OtherHT);
    (* "Card" here is 16 bits, "LongCard" is 32 *)

 PROCEDURE CreateHT(size:CARDINAL; elemtype:HashElemTypes;
                    hfunc:HashFunc; cfunc:CmpFunc; VAR
err:BOOLEAN):HashTable;
      (* size of 0 means use default size (eg, 1787) *)
      (* hfunc/cfunc of NilXXXFunc means use default function *)
      (* hfunc/cfunc cannot be default if elemtype is OtherHT *)

 PROCEDURE DestroyHT(VAR ht:HashTable);
      (* does not destroy data in the table *)

 PROCEDURE ClearHT(ht:HashTable);
      (* simply dangles any data in the table *)

 PROCEDURE AddHT(ht:HashTable; elem:ADDRESS):BOOLEAN;
 PROCEDURE RemoveHT(ht:HashTable; elem:ADDRESS):ADDRESS;
       (* if found (and removed), returns pointer to your record,
otherwise NIL *)

 PROCEDURE FindHT(ht:HashTable; SearchFor:ADDRESS):ADDRESS;

 PROCEDURE ProbeHT(ht:HashTable; SearchFor:ADDRESS):ADDRESS;
    (* If not found, add and return NIL, else return the hash table
element's addr *)

 PROCEDURE UpdateHT(ht:HashTable; SearchFor:ADDRESS):BOOLEAN;
    (* if found, replace the data by copying and return TRUE *)

 PROCEDURE ResetReadHT(ht:HashTable; ordered:BOOLEAN);
    (* must call at least once before sequential reading *)
 PROCEDURE ReadSeqHT(ht:HashTable):ADDRESS;
    (* at end, returns NIL *)
 PROCEDURE RefSeqHT(ht:HashTable):ADDRESS;

 (* for your convenience in constructing your own hash funcs on records
*)
 PROCEDURE Str0HashFunc(a:ADDRESS; siz:CARDINAL):CARDINAL;
 PROCEDURE DynStrHashFunc(a:ADDRESS; siz:CARDINAL):CARDINAL;
 PROCEDURE CardHashFunc(a:ADDRESS; siz:CARDINAL):CARDINAL;
 PROCEDURE LongCardHashFunc(a:ADDRESS; siz:CARDINAL):CARDINAL;

(*
NOTE:

   1. All searches are done via the key value in the record referenced
by
      the SearchFor pointer, NOT by the pointer itself. Operations done
as
      a result of the search hit are done on the record earlier inserted
into
      the hash table, NOT on the record passed via the SearchFor param.

      No moves or copies of your data are done. Thus, you can use the
supplied
      "standard" hash and compare functions for your RECORDs, so long as
the first
      field in the record is one of the "standard" types.

   2. Remove() and Destroy() do not deallocate your data, only the hash
table
      references and overhead,

VERSION HISTORY

   vx.x-070797   -first
            -082698   -Added HashTableStats()
*)

 PROCEDURE HashTableStats(ht:HashTable; VAR NumSlots, NumElems,
NumCollisions,
                          MaxCollisionLen:CARDINAL);

END HashT.

-------------------------------------------------------------------------------------------------------------------
-------------------------------------------------------------------------------------------------------------------

(*########################*)
 DEFINITION MODULE DynStr0;         (* v1.0-072096 for Benchmark
AmigaDOS *)
(*########################*)
(*
        Dynamically allocated string processing: Basic procedures.
*)

CONST MaxSiz = 32765;

TYPE DummyStr = POINTER TO ARRAY [0..MaxSiz] OF CHAR;   (* dummy index
*)

     DynStr = RECORD
                siz   :CARDINAL;
                len   :CARDINAL;
                str   :DummyStr;
              END;

PROCEDURE dStrCreate(VAR s:DynStr; size:CARDINAL);
   (* Create a new dynamic string *)

PROCEDURE dStrCreateCheck(VAR s:DynStr; size:CARDINAL; VAR
Success:BOOLEAN);
   (* Create a new dynamic string, checking for available memory first
*)

PROCEDURE dStrDispose(VAR s:DynStr);
   (* Dispose of dynamic string *)

PROCEDURE dStrAsgE(s1:DynStr; VAR s2:DynStr);
   (* Assign to an existing dynamic string *)

PROCEDURE dStrAsgC(s1:DynStr; VAR s2:DynStr);
   (* Create a dyn string and assign other string to it *)

PROCEDURE dStrInE(s1:ARRAY OF CHAR; VAR s2:DynStr);
   (* Assign to an existing DynStr from a "Str0" format string (literal)
*)

PROCEDURE dStrInC(s1:ARRAY OF CHAR; VAR s2:DynStr);
   (* Create a dyn string and assign a "Str0" string (or literal) to it
*)

PROCEDURE dStrOut(s1:DynStr; VAR s2:ARRAY OF CHAR);
   (* Assign to a "Str0" format string (literal) from a dynamic string
*)

PROCEDURE dStrLen(s:DynStr):CARDINAL;
   (* For compatibility with Str0 *)

(*
NOTE:
        0.The programmer is responsible for proper creation and disposal
of
          dynamic strings (via calls to dStrCreate and dStrDispose).

        1.The creation procs dStrAsgC and dStrInC always create the
smallest
          size dyn string that will hold the chars to be assigned to it.

          An existing string will not have its size increased if you
attempt
          to assign more chars to it than its size holds. The extra
chars
          are simply dropped.

          Use dStrCreate to make sure the dyn string has a specific
size.

        2.The compiler does not know the real size of a referenced
DynStr.
          ie, do not use SIZE(x.str^) or HIGH(.str^).

        3.The DynStr type is explicitly given here. You can reference
the
          .siz and .len fields as read-only please.

        4.For dStrCreate, if not enuf heap memory remains to create a
new
          dynamic string, then SystemTypes.StorageError will be raised
by
          the Heap module.

          For dStrCreateCheck, the Success parameter will be false.

        5.The .str field is guarenteed to be null delimited (if these
are
          used correctly). So it can be used as a Str0 string for
reference
          only. BE VERY CAREFUL ABOUT CALLING A NON-VAR DYNAMIC STRING
          PARAMETER WITH x.str^, SINCE COPYING 32767 BYTES ON THE STACK
          CAN BE FATAL.

          StonyBrook M2 will not copy non-var dynamic arrays if they
          are not "threatenen" with change. Check the warninging issued.
*)

END DynStr0.



Tue, 11 Sep 2001 03:00:00 GMT  
 
 [ 11 post ] 

 Relevant Pages 

1. string or int hashtables

2. another problem with hashtables

3. Hashtables in Lisp

4. weak hashtables in LW 4.2

5. Ada String Issue: String within Strings

6. string = string(i:j) // string(k:n)

7. BUG: Scientific Number to String/String to Number

8. Picture to String/String to Picture

9. string variable and string field

10. How to convert binary string to char string ?

11. CW strings (was CW1501 strings)

 

 
Powered by phpBB® Forum Software