0.2.7.1 Selecting a Hash Function

Search
Next: 0.2.7.2 Dealing with Collisions Up: 0.2.7 Hashing Previous: 0.2.7 Hashing

0.2.7.1 Selecting a Hash Function

Choosing a good hash function is of the utmost importance. An uniform hash function is one that equally distributes data items over the whole hash table data structure. If the hash function is poorly chosen data items may tend to clump in one area of the hash table and many collisions will ensue. A non-uniform dispersal pattern and a high collision rate cause an overall data structure performance degradation. There are several strategies for maximizing the uniformity of the hash function and thereby maximizing the efficiency of the hash table.

One method, called the division method , operates by dividing a data item's key value by the total size of the hash table and using the remainder of the division as the hash function return value. This method has the advantage of being very simple to compute and very easy to understand.

Selecting an appropriate hash table size is an important factor in determining the efficiency of the division method. If you choose to use this method, avoid hash table sizes that simply return a subset of the data item's key as the hash value. For instance, a table one-hundred items large will result put key value 12345 at location forty-five, which is undesirable. Further, an even data item key should not always map to an even hash value (and, likewise, odd key values should not always produce odd hash values). A good rule of thumb in selecting your hash table size for use with a division method hash function is to pick a prime number that is not close to any power of two (2, 4, 8, 16, 32...).

int hash_function(data_item item) 
{
  return item.key % hash_table_size;
}

Sometimes it is inconvenient to have the hash table size be prime. In certain cases only a hash table size which is a power of two will work. A simple way of dealing with table sizes which are powers of two is to use the following formula to computer a key: k = (x mod p) mod m. In the above expression x is the data item key, p is a prime number, and m is the hash table size. Choosing p to be much larger than m improves the uniformity of this key selection process.

Yet another hash function computation method, called the multiplication method, can be used with hash tables with a size that is a power of two. The data item's key is multiplied by a constant, k and then bit-shifted to compute the hash function return value.

A good choice for the constant, k is N * (sqrt(5) - 1) / 2 where N is the size of the hash table.

The product key * k is then bitwise shifted right to determine the final hash value. The number of right shifts should be equal to the log₂ N subtracted from the number of bits in a data item key. For instance, for a 1024 position table (or 2¹⁰) and a 16-bit data item key, you should shift the product key * k right six (or 16 - 10) places.

int hash_function(data_item item) 
{
  extern int constant;
  extern int shifts;

  return (int)((constant * item.key) >> shifts);
}

Note that the above method is only effective when all data item keys are of the same, fixed size (in bits). To hash non-fixed length data item keys another method is variable string addition so named because it is often used to hash variable length strings. A table size of 256 is used. The hash function works by first summing the ASCII value of each character in the variable length strings. Next, to determine the hash value of a given string, this sum is divided by 256. The remainder of this division will be in the range of 0 to 255 and becomes the item's hash value.

int hash_function (char *str) 
{
  int total = 0;

  while (*str) {
    total += *str++;
  }
  return (total % 256);
}

Yet another method for hashing non fixed-length data is called compression function and discussed in the one-way hashing section.

Scott Gasch
1999-07-09