Choosing a good hash function is of the utmost importance. An uniform hash function
is one that equally distributes data items over the whole hash table
data structure. If the hash function is poorly chosen data items may
tend to clump
in one area of the hash table and many collisions will ensue. A
non-uniform dispersal pattern and a high collision rate cause an
overall data structure performance degradation. There are several
strategies for maximizing the uniformity of the hash function and
thereby maximizing the efficiency of the hash table.
One method, called the division method
, operates by dividing a data item's key value by the total size of
the hash table and using the remainder of the division as the hash
function return value. This method has the advantage of being very
simple to compute and very easy to understand.
Selecting an appropriate hash table size is an important factor in
determining the efficiency of the division method. If you choose to
use this method, avoid hash table sizes that simply return a subset of
the data item's key as the hash value. For instance, a table
one-hundred items large will result put key value 12345 at location
forty-five, which is undesirable. Further, an even data item key
should not always map to an even hash value (and, likewise, odd key
values should not always produce odd hash values). A good rule of
thumb in selecting your hash table size for use with a division method
hash function is to pick a prime number that is not close to any power
of two (2, 4, 8, 16, 32...).
int hash_function(data_item item)
{
return item.key % hash_table_size;
}
Sometimes it is inconvenient to have the hash table size be prime. In
certain cases only a hash table size which is a power of two will work.
A simple way of dealing with table sizes which are powers of two is
to use the following formula to computer a key:
k = (x mod p) mod m.
In the above expression x is the data item key, p is a prime
number, and m is the hash table size. Choosing p to be much
larger than m improves the uniformity of this key selection process.
Yet another hash function computation method, called the multiplication method, can be used with hash tables with a size that
is a power of two. The data item's key is multiplied by a constant,
k and then bit-shifted to compute the hash function return value.
A good choice for the constant, k is
N * (sqrt(5) - 1) / 2 where
N is the size of the hash table.
The product key * k is then bitwise shifted right to determine the
final hash value. The number of right shifts should be equal to the
log2 N subtracted from the number of bits in a data item key. For
instance, for a 1024 position table (or 210) and a 16-bit data
item key, you should shift the product key * k right six (or 16 -
10) places.
int hash_function(data_item item)
{
extern int constant;
extern int shifts;
return (int)((constant * item.key) >> shifts);
}
Note that the above method is only effective when all data item keys
are of the same, fixed size (in bits). To hash non-fixed length data
item keys another method is variable string addition
so named because it is often used to hash variable length strings. A
table size of 256 is used. The hash function works by first summing
the ASCII value of each character in the variable length strings.
Next, to determine the hash value of a given string, this sum is
divided by 256. The remainder of this division will be in the range
of 0 to 255 and becomes the item's hash value.
int hash_function (char *str)
{
int total = 0;
while (*str) {
total += *str++;
}
return (total % 256);
}
Yet another method for hashing non fixed-length data is called compression function and discussed in the one-way hashing section.
|