0.10.1 Run-Length Encoding

Search
Next: 0.10.1.1 Source Code Up: 0.10 Data Compression Algorithms Previous: 0.10 Data Compression Algorithms

0.10.1 Run-Length Encoding

Sometimes called recurrence coding, This is one of the simplest data compression algorithms but is effective for data sets which are comprised of long sequences of a single repeated character. For instance, text files with large runs of spaces or tabs may compress well with this algorithm. Old versions of the arc compression program used run-length encoding.

The way RLE works is by finding runs of repeated characters in the input stream and replacing them with a three-byte code. The code consists of a flag character, a count byte, and the repeated character. For instance, the string ``AAAAAABBBBCCCCC'' could be more efficiently represented as ``*A6*B4*C5''. That saves us six bytes. Of course, since it does not make sense to represent runs less than three characters in length with a code, none is used. Thus ``AAAAAABBCCCDDDD'' might be represented as ``*A6BBCCC*D4''. The flag byte is called a sentinel byte.

One problem with this approach lies in the selection of the sentinel value. This flag signals to the decompression code that an encoded sequence follows. Ideally we would like to select a byte value that does not occur in the input stream. However, running through the input stream looking for an absent value slows down this algorithm. Since the main advantage of this algorithm is its speed, slowing it down for only slightly better compression results is not usually considered. Often an arbitrary, non-letter ASCII value is chosen as the sentinel. When compressing a stream with natural occurrences of the sentinel flag, the flag byte must be represented. Sometimes a three-byte code is used with a count of less than three. Another aproach is to use a doubled flag byte represent one naturally occurring flag byte in the input. Thus, ``AAAAAA*BBBBCC*D'' would encode to ``*A6***B4CC**D''. The two stars after ``A6'' and the final two stars denote a natural star in the input stream.

It is wasteful to only encode run counts with ordinal numbers from zero to nine. Instead we can use the value of the byte to denote the number of characters in the run. In this way we can get a range of zero to 255. However, since we are never encoding any run with less than four characters, the range becomes four to 259. Runs comprised of more than 259 characters in them will be broken up into two or more three-byte flag-character-count sequences.

RLE is used as one of the steps in the JPEG image compression process. Basically the JPEG algorithm breaks an image up into a series of 8x8 matrixes and proceeds to run a ``DCT'' transformation on each matrix. This transformation isolates the important image components in the upper left portion of the matrix. The lossy step of the JPEG process happens when values far from the upper-left corner are rounded. Unless these values are of very high magnitude they tend to become zeros. However, the further away from the upper left corner of the matrix, the less likely bytes are to have a high magniture so there ends up being a lot of zeros in the transformed matrix. To maximize the length of zero runs, JPEG scans the matrix in a diagonal pattern and then uses RLE to encode multiple zeros in a row efficiently. This is a terribly high-level explaination of how JPEG works and I would recommend reading Nelson and Gailly's chapter on lossy graphic compression for more information.

Scott Gasch
1999-07-09