0.10.3 Huffman Compression

Search
Next: 0.10.3.1 Source Code Up: 0.10 Data Compression Algorithms Previous: 0.10.2.2 References

0.10.3 Huffman Compression

Huffman compression was first proposed in a paper by D. A. Huffman in the early 1950's. Huffman coding is used in the pack compression program and a varient called dynamic huffman coding is used (along with LZ) in freeze.

Before compressing data, Huffman encoders first carefully analize the input stream. In the analysis, Huffman routines carefully tally character frequency data. That is, for each distinct character in the input (up to 256 character byte values) a count is stored.

void count_bytes(FILE *pfInput, unsigned long *puCounts)
{
  long lInputMarker;
  int c;

  //
  // Preconditions
  //
  ASSERT(pfInput);
  ASSERT(puCounts);

  //
  // Remember where the file pointer is now.
  //
  lInputMarker = ftell(pfInput);
  ASSERT(lInputMarker >= 0);
  
  //
  // Tally the characters from here to EOF
  //
  while ((c = getc(pfInput)) != EOF)
    counts[c]++;
  
  //
  // Put the pointer back where it was.
  //
  fseek(pfInput, lInputMarker, SEEK_SET );
  ASSERT(ftell(pfInput) == lInputMarker);
}

In an effort to keep counts to a reasonable size, the character frequency values are often scaled. For instance, it is often desireable to have counts fit in one byte of data. So, by searching for the highest count value, h and assigning it the value 0xFFthen scaling the rest of the counts by h / 0xFF all count values should fit in one byte. In the source code I refer to scaled counts as ``weights.''

Next Huffman algorithms build a full binary tree which will be used to determine the codewords used to compress frequently encountered characters. To construct the tree, first the routine creates nsmall, one-node trees (where n is the number of distinct characters in the input stream). Each of these n trees represent a distinct input character and have a weight cooresponding to their count tallied in the analysis step.

The tree building process begins by selecting two nodes from the field of candidates. The two nodes selected are the ones with the lowest weights. These nodes are then joined into one tree. Each node becomes a leaf off of a newly created root node. The letter values and weights of the two nodes remain the same. The weight of the root node is set to the sum of the weights of its two leaves. This weight is the new weight of this three-node tree.

This process continues - the tree building process loops selecting the two trees (with anywhere from 1 to (n - 1) nodes) with lowest weight. The two trees selected are joined by a new root node, the root node's weight is set, and the new tree is placed back into the pool. The process repeats until one tree encompassing all the input weights has been constructed. If at any point there is more than one way to choose the two trees of smallest weight the algorithm chooses arbitrarily. This large tree is called a Huffman tree.

Once the Huffman tree has been constructed each letter can be assigned a codeword. The unique codeword for a given letter is computed by traversing the Huffman tree from the root node down to the leaf containing the letter and its weight. Because the Huffman tree is a binary tree, each node has, at most, two children. In the path from the root to a given leaf node, anytime the algorithm traverses to a left child a "0" is added to the codeword while the choice of a right child appends a "1" to the codeword. Due to the special nature of the Huffman tree, no codeword produced will every be the prefix of another. Also, letters appearing frequently in the input stream will be represented by small codewords such as 01, 11, 001 (etc...) while those that do not appear frequently will be represented by longer codewords.

Once a mapping has been created the encoding process can begin. During this operation the input stream is again scanned character by character. For each character read, the equivalent codeword is computed and appended to the output stream. In practice, a mapping between all characters and their codewords is usually computed before any encoding takes place. This can be accomplished recursively:

BOOL convert_tree_to_code(NODE *pnNodes, CODE *pcCodes, 
			  unsigned int iCodeSoFar, int iBits, int iNode)
{
  ASSERT(pnNodes);
  ASSERT(pcCodes);

  //
  // If this is a leaf node we are done recursing, assign code and pop stack
  //
  if (iNode <= END_OF_STREAM) 
  {
    ASSERT(iBits);
    ASSERT(iCodeSoFar);

    //
    // Code
    //
    pcCodes[iNode].uCode = iCodeSoFar;

    //
    // Length of code
    //
    codes[iNode].iCodeBits = iBits;
    return;
  }

  //
  // Otherwise we are on an internal node and need to keep going
  //
  iCodeSoFar <<= 1;
  ASSERT((iCodeSoFar | 0) == iCodeSoFar);

  //
  // One more bit about to be added
  //
  iBits++;

  //
  // When going right, add a zero to the code so far..
  //
  convert_tree_to_code(pnNodes, pcCodes, iCodeSoFar, iBits,
		       pnNodes[iNode].iLeftChild);

  //
  // When going left add a one..
  //
  convert_tree_to_code(pnNodes, pcCodes, iCodeSoFar | 1, iBits, 
		       pnNodes[iNode].iRightChild);
}

Because the process of computing codewords for each input character is slow, pre-computing helps speed up the algorithm. Since frequently used characters tend to be represented in less than 8 bits, the compressed output file tends to be smaller than the input.

In order to decode a Huffman compressed message the Huffman tree used to generate the codewords must be available to the decoding process. For this reason the encoding process often includes a representation of the Huffman tree its output. Because the message is usually long, such overhead may be acceptable. However, an alternative is for encoder and decoder to agree on a preset encoding scheme based on average frequencies of material to be transmitted. This, however, will almost certainly lead to a less than optimal compression rate.

Because the Huffman algorithm must rescan the input stream twice it is usually slower than alternative algorithms. The process of scanning the input stream is an O(n) operation (where n the input size). This occurs twice: once to create the frequency tables and once to encode the data.

The process of constructing a Huffman tree (which occurs between the two listed above) has a complexity based on the number of distinct characters in the input. To build one tree from c one-node trees, the selection process must execute (c - 1) times. This process must search the entire pool for a tree - which is an $\Theta(n/2)$ operation. However, pre-sorting the pool and inserting aggregate trees at the right place in the pool speeds up the selection process as it need only select the first two trees in the list - an O(1)process.

Each time two trees are joined the node-creation routine must execute in order to create the root node joining the two trees. Therefore, it also executes (c - 1) times. The average number of nodes in a Huffman tree is approximately the number in a full binary tree with c leaves - 2^c - 1.

In practice, Huffman encoding yields good compression results. However, it does not do as good a job as sliding window routines (based on LZ algorithms) for average input data. It also does not execute as quickly as LZ routines.

Scott Gasch
1999-07-09