Sometimes called recurrence coding, This is one of the simplest
data compression algorithms but is effective for data sets which are
comprised of long sequences of a single repeated character. For
instance, text files with large runs of spaces or tabs may compress
well with this algorithm. Old versions of the arc compression
program used run-length encoding.
The way RLE works is by finding runs of repeated characters in the
input stream and replacing them with a three-byte code. The code
consists of a flag character, a count byte, and the repeated
character. For instance, the string ``AAAAAABBBBCCCCC'' could be more
efficiently represented as ``*A6*B4*C5''. That saves us six bytes.
Of course, since it does not make sense to represent runs less than
three characters in length with a code, none is used. Thus
``AAAAAABBCCCDDDD'' might be represented as ``*A6BBCCC*D4''. The flag
byte is called a sentinel byte.
One problem with this approach lies in the selection of the sentinel
value. This flag signals to the decompression code that an encoded
sequence follows. Ideally we would like to select a byte value that
does not occur in the input stream. However, running through the
input stream looking for an absent value slows down this algorithm.
Since the main advantage of this algorithm is its speed, slowing it
down for only slightly better compression results is not usually
considered. Often an arbitrary, non-letter ASCII value is chosen as
the sentinel. When compressing a stream with natural occurrences of
the sentinel flag,
the flag byte must be represented. Sometimes a three-byte code is
used with a count of less than three. Another aproach is to use a
doubled flag byte represent one naturally occurring flag byte in the
input. Thus, ``AAAAAA*BBBBCC*D'' would encode to ``*A6***B4CC**D''.
The two stars after ``A6'' and the final two stars denote a natural
star in the input stream.
It is wasteful to only encode run counts with ordinal numbers from
zero to nine. Instead we can use the value of the byte to denote the
number of characters in the run. In this way we can get a range of
zero to 255. However, since we are never encoding any run with less
than four characters, the range becomes four to 259. Runs comprised
of more than 259 characters in them will be broken up into two or more
three-byte flag-character-count sequences.
RLE is used as one of the steps in the JPEG image compression process.
Basically the JPEG algorithm breaks an image up into a series of 8x8
matrixes and proceeds to run a ``DCT'' transformation on each matrix.
This transformation isolates the important image components in the
upper left portion of the matrix. The lossy step of the JPEG process
happens when values far from the upper-left corner are rounded.
Unless these values are of very high magnitude they tend to become
zeros. However, the further away from the upper left corner of the
matrix, the less likely bytes are to have a high magniture so there
ends up being a lot of zeros in the transformed matrix. To maximize
the length of zero runs, JPEG scans the matrix in a diagonal pattern
and then uses RLE to encode multiple zeros in a row efficiently. This
is a terribly high-level explaination of how JPEG works and I would
recommend reading Nelson and Gailly's chapter on lossy graphic
compression for more information.
|