No matter how good your hash function is, or how carefully you choose
the hash table size, sometimes data collisions are bound to occur.
Recall that a collision is when two distinct data items produce the
same hash value and, thus, want to be stored in the same table
location. For instance, suppose you have an item, A, already in your
hash table at location 144. A's hash function value, therefore, is
144. Another item, B, with a totally different key than A, is to be
added to the table. Suppose, however, that B's hash value is 144also. B cannot be stored at table location 144 because A is already
there... or can it?
One way of dealing with this situation is to make each location in the
hash table the head of a linked-list
data structure. If you find a collision has occurred, traverse down
the linked list at the hash value and add the new data item to the
list's tail (or head). Of course, when you search for an item in a
table using this insertion scheme you must be mindful of the fact that
such an item may not be at the head of the linked list residing in the
hash table at the hash location. This method is sometimes known as
open hashing
because multiple data items sharing the same hash value are stored
``outside'' the hash table.
The following methods all store collided data inside the hash table at
different locations than their computed hash value. This practice is
known as closed hashing.
The classic way to deal with collisions is to simply increment an
item's hash value by one until a finding an unoccupied hash table
location then store the item a location or two away from it's computed
hash value. This method is known as linear probing.
In querying a table employing such an insertion scheme you have to not
only check at the hash location of the item for which you are looking,
but, if it contains some item other than the one you seek, continue to
search in adjacent hash table locations until you either find your
goal item or encounter an empty table location. Linear probing, while
simple to understand and implement, leads to data clumping and is not
the ideal way of handling collisions.
Different spins on the concept of probing are known as quadratic
probing
and random probing.
In quadratic probing instead of simply moving one address down in the
hash table the number of spaces moved is somehow dependent on the
number of times that we have moved so far. For example, consider the
following hash value offset function:
o(i) = 2i - 1
The first collision adds one address to the base hash address, the
second three, the third seven, and so on. When quadratic probing is
employed, collisions tend not to produce the clusters
of full areas in the hash table which are linear probing's drawback.
Random probing uses the address of the collision as the seed for a
pseudo-random number generator and computes the next address to try
with a function which takes a random element into account. Both
quadratic and random probing are usually slower than linear probing
but produce more uniform hash data distributions.
One final way of dealing with collisions is called rehashing or
dualhashing.
If the hash address produced is a collision, the address is reused as
input to the hash function in order to compute a new address. This
process, as you might expect, complicates lookups and deletes.
|