Here is the second part of Joa's essay.
See also the first part (Introduction)
Joa's work is extremely clear and will guide and help you to
understand these matters, that for obvious reasons no real reverser should ever
underestimate. My advice: read, re-read and then -once you have understood the basis-
search the web and deepen (and then contribute :-)
Little essay about the various methods and viewpoints of crunching. Part II: Propability and Crunching (Huffman isn't the only way) By Joa Koester (JoKo2000@HotMail.Com) This essay is free to be copied and read and spread as long as you are decent enough to leave my name in it. If you have corrections, comments, better examples than the ones used in this essay, etc. - drop me a line. OK, let's dive into the world of crunchers and munchers... In the Intro i told you that crunching is a process of reducing the SPACE data needs to be stored, but not the MEANING (the informations transmitted in the symbols). The meter 'Entropy' is used here like in the physics. The more entropy a symbol has, the LESS information is transmitted thru it. Less information compared to the whole of all symbols, that is. This means of course there is a lot of redundancy in it - meaning a big 'crunch me, crunch me' for us. So we should concentrate on the symbols that have a very high entropy. How we calculate an entropy for our means depends on the data we want to crunch and with which algorithm we want to crunch it. I also told you that a certain student called Huffman once came up with an algorithm to compress data. His idea is based on the fact that in data (escpecially speech) some symbols happen to be more often than others. Here one overstated example: aaaaaaaaaabbbbbcccdd What we have here is a algorithm based on propability. That means that some symbols are not transmitted in the way they used to be, because the algorithm will create new symbols in respect to the propabilities of the symbols. Some of these new symbols will be shorter. And those lesser apperaring symbols will be transformed into longer symbols: Before: a = 61h = %0110 0001 b = 62h = %0110 0010 c = 63h = %0110 0011 d = 64h = %0110 0100 and after: ???? How do we compute such a new symbol table? OK, before we explore this algorithm let me change the data from ASCII to the lower nibble. That should be enough to explain the algorithm: Before: a = %0001 b = %0010 c = %0011 d = %0100 and we have the propability: a = 10 b = 5 c = 3 d = 2 === 20zero != NULL) ) knot = knot->zero else if (knot->one != NULL) ) knot = knot->one if (knot.one == NULL) //is enough. knot.zero is NULL, too //symbol found. great. To the output with it symbol = knot.symbol end if end while Do_Outpur_Char(symbol); end while Done. Hm, i have another little problem... What is it, Watson? I did as you told me. Took a table of 256 longs, counted the 256 chars, sorted them and let the computer calculate an optimized Huffman-tree. But all i got was a tree with all leaves having a length of 8 - all chars are encoded with 8 bits and there is no crunching in that. What happened? Ah, Dr. Watson, you did not remember my sentence about Entropy. This file you examined had all bytes nearly equally spread all over the length of the file. So -statisticalwise- the file is equalized. No surprises here. With a Huffman-tree, which considers the whole file you won't get any crunching here, i fear. But what do i do now? Well, there are three scenarios that will satisfy your desire: - We use a different algorithm - We use a brutal tree - We use an adapting technique ??? Tell me about the brutal tree. Ok, i got this one from the 1001 Packer from the C64 (Anyone ringing a bell? 1001 Crew? No? Where are the old days gone to...) They used a tree i just can label 'brutal tree'. Here's the idea: When we have the case that we have data that is so equal that no Huffman-tree will help us, we help us ourself. We take all the Bytecounters and sort them. Then we create a table of the 256 chars ordered by the sorted table, so that the still most frequent chars appear on the top. Now we have the 256 possible bytes sorted. So far so good. Lets think about a byte. How do we read a byte? Isn't it: #33 = $21 = %0010 0001 ? What, if we we would change our view of the binary to a 2 + 6: #33 = $21 = %00 100001 Our 256 chars would now be represented as 4 (00, 01, 10, 11) blocks of 6 bits. What if we now 'huffman' these first two bits to: 0 10 110 111 So that we would have %0 100001 instead of %00 100001. Every byte that is in the first 64Byteblock is coded with 7 bits, while the others are coded either with 8 or with 9 bits. We can only hope that there are enough bytes in the file to outweigh the two 9bit-chars. But it is amazing how this little trick works. And the speed. It's so easy to implement it and it is so fast. Even on the C64 it took only fractals of a second to decode a programm as big as the main memory area (other packers took several seconds for the same size). Wow, cool. But what about your different algorithms and this 'adapting technique' ? Well, this is another story and will be told in the next chapter... I hope you enjoyed this chapter. There will be more - i promise. Just have a little patience. Greetings from a programmer Joa
In the meantime, after having read Joa's work above, you may enjoy a look at a huffer and a unhuffer, both written in C some time a go, and donated to the world, by Shaun Case... once you are done studying it, may be you'll want to have a look also at sixpack another more advanced (adaptive Huffman encoding) compresser, written in C as well, by Philip Gage, for a data compression competition some time ago... since this code includes SIX different algorhitms, I believe it could be useful for readers as 'passage' to the next part of Joa's essay (if Joa doesn't agree with me I'll kill these links of course).