0.10.5 Sliding Window Compression

Search
Next: 0.10.5.1 Source Code Up: 0.10 Data Compression Algorithms Previous: 0.10.4.1 References

0.10.5 Sliding Window Compression

Recall the RLE (run-length encoding) is a compression algorithm that represents long runs of a single character more efficiently. Sliding window compression is an algorithm that represents groups of characters that occur frequently more efficiently. For instance, ``can count countville count the countably infinite count'' repeats the string ``count'' five times. It would not be compressed at all with RLE since no characters repeat. But with sliding window, the phrase ``count'' would be represented with a small code.

Sliding window compression is a dictionary-based compression algorithm based on the work of Lempel and Ziv in 1977. Dictionary-based compression algorithms maintain a group of strings from the input stream as the encoding process executes. This group of strings is called a dictionary and can be used to abbreviate recurring patterns in the text. If the algorithm spots a string in the input stream that it has seen already and has stored as part of the dictionary, the string can be represented in a more efficient manner. Usually dictionary entries are denoted with a two-byte code that points to its entry number in the dictionary and its size. During the decompression process the routine creates and maintains a dictionary with the same rules as were used in the compression process. So entry n in the dictionary is the same at compression and decompression time.

Before implementing a sliding window compression scheme the size (number of possible entries) of the dictionary must be fixed. More dictionary entries mean a greater chance that a repeated string will be found in the dictionary text and therefore compressed. However, more dictionary entries also lead to longer dictionary codes. Typical sliding window algorithms use a dictionary size of 2¹² = 4096entries. The dictionary code is, therefore, always a 12 bit number.

Since some words found in the input stream will have no exact matches in the dictionary but will be prefixes of dictionary words, most sliding window algorithms choose to add a length code to the dictionary code. For instance, imagine ``eyeball'' is in the dictionary at position 255. The word ``eye'' is not in the dictionary at all. The word ``eye'' now appears in the input text. If we use twelve bits to represent the dictionary entry for ``eyeball'' ( 000001111111) and use four bits to say ``use only the first three letters of entry 255'' (0011) then we can represent ``eye'' as two bytes instead of three ( 00000111 11110011). The choice of twelve bit entry codes and four bit prefix codes make the amount of data needed to represent any dictionary entry two bytes. These choices also limit the number of dictionary entries to 2¹² = 4096 and the max length of a dictionary string to 2⁴ = 16 bytes. Since, however, we will never encode a one or two byte sequence, this kicks the size limit of the dictionary up to 16 + 2 = 18 bytes.

You may be wondering how we can distinguish encoded dictionary pointers from raw (unencoded) text. The answer is to use accounting bytes. Before every sequence of eight bytes in the compressed text we use a flag byte. This byte contains no data from the input stream - rather it is used to tell the decompresser which bytes in the following eight are raw text and which contain dictionary compression codes. For instance imagine we have the string ``invite him!'' in the input stream. The ``invit'' is encoded using a dictionary code of 00000111 11110101 - use the first five bytes of entry 255, which was ``invitation''. The rest of the text cannot be found in the dictionary. It is added to the dictionary for future processing but is encoded as raw ``e him!''. We would write the following eight bytes to the output stream: ( 00000111 11110101)=dictionary code for ``invit'' (01100101)=``e'' (00100000)=space (01101000)=``h'' (01101001)=``i'' (01101101)=``m'' (00100001)=``!''. However, before these eight bytes we would write the flag byte 00111111. The 0's tell the decoder that the first two of the next eight bytes are dictionary codes whereas the next six are raw text. The use of flag bytes, of course, decreases the efficiency of the compression.

Given this information we can determine the best and worst case compression rates of sliding window compression. Since we are using flag bytes, if no compression is possible we will encode nine bytes in the output for every eight bytes of input. Therefore, the worst case for this algorithm increases the file size to 112.5 percent of the original. If every character sequence is represented with a dictionary entry of maximum length (18 bytes) then each 18 byte sequence will be represented by 16 bits + 1 bit (in the flag byte). That means the best possible compression will see an output file that is 11.8 percent the size of the original.

It is inefficient to search the whole previous window of text for matches. The way most implementations of sliding window algorithms handle this is by keeping the dictionary in a data structure such as a binary search tree. That way it is easy to search for newly read text in the dictionary - just traverse the binary tree. Remember that new text might only partially match text in the dictionary, though. We want the node on the root-to-leaf traversal that best matches the new text. It is also possible that the new text will exactly match a node in the tree. In this case the node must be replaced with the new data and its offset. Likewise, as text falls out of the window it is important to delete the cooresponding node. Deletion in a binary search tree is a little tricky so check the BST section for a full explaination. Studies have shown that it is not a good idea to use a more complicated data structure such as a red-black tree or an AVL tree. While these data structures will be better balanced than a BST, in most cases, the extra time spent keeping them balanced slows down the compression. Due to the dynamic nature of the tree, caused by phrases being continually added to and deleted from the dictionary, highly unbalanced trees should quickly correct themselves.

Scott Gasch
1999-07-09