0.10.5 Sliding Window Compression
Recall the RLE (run-length encoding) is a compression algorithm that
represents long runs of a single character more efficiently. Sliding
window compression is an algorithm that represents groups of
characters that occur frequently more efficiently. For instance,
``can count countville count the countably infinite count'' repeats
the string ``count'' five times. It would not be compressed at all
with RLE since no characters repeat. But with sliding window, the
phrase ``count'' would be represented with a small code.
Sliding window compression is a dictionary-based compression
algorithm based on the work of Lempel and Ziv in 1977.
Dictionary-based compression algorithms maintain a group of strings
from the input stream as the encoding process executes. This group of
strings is called a dictionary and can be used to abbreviate recurring
patterns in the text. If the algorithm spots a string in the input
stream that it has seen already and has stored as part of the
dictionary, the string can be represented in a more efficient manner.
Usually dictionary entries are denoted with a two-byte code that
points to its entry number in the dictionary and its size. During the
decompression process the routine creates and maintains a dictionary
with the same rules as were used in the compression process. So entry
n in the dictionary is the same at compression and decompression
time.
Before implementing a sliding window compression scheme the size
(number of possible entries) of the dictionary must be fixed. More
dictionary entries mean a greater chance that a repeated string will
be found in the dictionary text and therefore compressed. However,
more dictionary entries also lead to longer dictionary codes. Typical
sliding window algorithms use a dictionary size of
212 = 4096entries. The dictionary code is, therefore, always a 12 bit number.
Since some words found in the input stream will have no exact matches
in the dictionary but will be prefixes of dictionary words, most
sliding window algorithms choose to add a length code to the
dictionary code. For instance, imagine ``eyeball'' is in the
dictionary at position 255. The word ``eye'' is not in the dictionary
at all. The word ``eye'' now appears in the input text. If we use
twelve bits to represent the dictionary entry for ``eyeball''
(
000001111111) and use four bits to say ``use only the first three
letters of entry 255'' (0011) then we can represent ``eye'' as two
bytes instead of three (
00000111 11110011). The choice of twelve bit
entry codes and four bit prefix codes make the amount of data needed
to represent any dictionary entry two bytes. These choices also limit
the number of dictionary entries to
212 = 4096 and the max length
of a dictionary string to 24 = 16 bytes. Since, however, we will
never encode a one or two byte sequence, this kicks the size limit of
the dictionary up to
16 + 2 = 18 bytes.
You may be wondering how we can distinguish encoded dictionary
pointers from raw (unencoded) text. The answer is to use accounting
bytes. Before every sequence of eight bytes in the compressed text we
use a flag byte. This byte contains no data from the input
stream - rather it is used to tell the decompresser which bytes in
the following eight are raw text and which contain dictionary
compression codes. For instance imagine we have the string ``invite
him!'' in the input stream. The ``invit'' is encoded using a
dictionary code of
00000111 11110101 - use the first five bytes of
entry 255, which was ``invitation''. The rest of the text cannot be
found in the dictionary. It is added to the dictionary for future
processing but is encoded as raw ``e him!''. We would write the
following eight bytes to the output stream: (
00000111
11110101)=dictionary code for ``invit'' (01100101)=``e''
(00100000)=space (01101000)=``h'' (01101001)=``i''
(01101101)=``m'' (00100001)=``!''. However, before these eight
bytes we would write the flag byte 00111111. The 0's tell the
decoder that the first two of the next eight bytes are dictionary
codes whereas the next six are raw text. The use of flag bytes, of
course, decreases the efficiency of the compression.
Given this information we can determine the best and worst case
compression rates of sliding window compression. Since we are using
flag bytes, if no compression is possible we will encode nine bytes in
the output for every eight bytes of input. Therefore, the worst case
for this algorithm increases the file size to 112.5 percent of the
original. If every character sequence is represented with a
dictionary entry of maximum length (18 bytes) then each 18 byte
sequence will be represented by 16 bits + 1 bit (in the flag
byte). That means the best possible compression will see an output
file that is 11.8 percent the size of the original.
It is inefficient to search the whole previous window of text for
matches. The way most implementations of sliding window algorithms
handle this is by keeping the dictionary in a data structure such as a
binary search tree. That way it is easy to search for newly read text
in the dictionary - just traverse the binary tree. Remember that new
text might only partially match text in the dictionary, though. We
want the node on the root-to-leaf traversal that best matches the new
text. It is also possible that the new text will exactly match a node
in the tree. In this case the node must be replaced with the new data
and its offset. Likewise, as text falls out of the window it is
important to delete the cooresponding node. Deletion in a binary
search tree is a little tricky so check the BST section for a full
explaination. Studies have shown that it is not a good idea to use a
more complicated data structure such as a red-black tree or an AVL
tree. While these data structures will be better balanced than a BST,
in most cases, the extra time spent keeping them balanced slows down
the compression. Due to the dynamic nature of the tree, caused by
phrases being continually added to and deleted from the dictionary,
highly unbalanced trees should quickly correct themselves.
|