Well, Joa continues his fundamental paper on crunching, this is part III, enjoy!
05 June 98 | Joa | ~ | crunchi1.htm | Little essay about the various methods and viewpoints of crunching | papers | ~ | fra_0126 |
10 June 98 | Joa | ~ | crunchi2.htm | Little essay about the various methods and viewpoints of crunching II | papers | ~ | fra_0129 |
17 June 98 | Joa | ~ | crunchi3.htm | Little essay about the various methods and viewpoints of crunching III | papers | ~ | fra_012E |
17 June 98 | Joa | ~ | crunchi4.htm | Little essay about the various methods and viewpoints of crunching IV | papers | ~ | fra_012F |
Little essay about the various methods and viewpoints of crunching. Part III: Adapting to the sources (Huffman revisited) By Joa Koester (JoKo2000@HotMail.Com) This essay is free to be copied and read and spread as long as you are decent enough to leave my name in it. If you have corrections, comments, better examples than the ones used in this essay, etc. - drop me a line. OK, let's dive into the world of crunchers and munchers... In the last chapter Dr. Watson stumbled over the problem of big files with so many charakters in it that the frequency of all 256 possible chars are -statisticalwise- nearly the same. Running a Huffman-algorithm on such files leads to trees with 8bits encoding for each char - thus giving us no crunching efficiency. But on the other hand everyone with a hex-editor in his toolbox has made the experience that files are somehow seperated (chunked). Some areas seem to yield code, others appear as vast deserts of zeros, and again others have up- and down-counting values. On the PC you have for example Code and Resource areas, which are seperately created and just linked together (hence the Borland Resource Workshop 4.5). But how do we take advantage of this circumstance? Well, the key to our problem is the tree. Once it is created, we are doomed with it, unable to change it, only able to encode our data with this static, inflexible tree, right? Not quite! If we could adjust our tree to the actual data somehow, we would be able to generate several trees in the runtime, each tree nearly optimized for the actual data coming up. But how do we adjust our tree? And what about the decoder? And how do we build up the first tree? Yes, the first tree. What about it? We are about to be flexible, right? Could we take the pain of the first data not perfectly encoded? What, if we could? What, if we would build up a tree with all frequencies set to one? Each symbol would be written with 8 bits. But the more we would read from the file, the more we would adapt to the data read and the more we could cope with sudden changes in the data flow. so the basic idea would be: Cruncher: initialize_tree_with_all_frequencies_to_1 while not eof { read char c encode c, Output frequencyarray[c]++ adapt tree, c } Decruncher: initialize_tree_with_all_frequencies_to_1 Mychar = 0 while Mychar != EOF { Mychar = decode(Input) //read symbol to the next symbol put Mychar, Output frequencyarray[Mychar]++ adapt tree, Mychar } This would work and would be stable. Unfortunately, even programmed in Assembler we would have to wait to many hours to crunch a simple file. The resorting of the tree after each byte read/written eats up to many memory-read/writes. Imagine crunching a file like Winword '95 (Ver 7.0) with it's 3.845.120 bytes. Imagine a tree-struct having about 20 bytes in memory. Multiply that with 257 = 5140 bytes, not necessarily stored nearly together or array-like (so read-ahead memory chips can't outplay their abilities). Imagine this 5140 bytes getting sorted 3.845.120 times = 1.97639168exp10 bytes shifted thru your memory just for Micro$oft (assumed you would have only 1 sorting step while building the tree)... There must be a better way, right? Yes, Watson, there is a better way. And it's not so difficult to implement it for our purposes.We have to implement a watcher! This watcher can be an intelligent algorithm or a not-so-intelligent counter with a conditional rebuild. Let's examine the latter method first: As long as normal files are not bigger than 4 GByte we can use a simple 32bit-integer for our counter. Let's name it "Adjustcounter". This counter will be incremented and tested afterwards: Cruncher: initialize_tree_with_all_frequencies_to_1 Adjustcounter = 0 while not eof { read char c encode c, Output frequencyarray[c]++ Adjustcounter = Adjustcounter + 1 if Adjustcounter > 4096 adapt tree set all of frequencyarray[] to 1 Adjustcounter = 0 end if } Decruncher: initialize_tree_with_all_frequencies_to_1 Mychar = 0 Adjustcounter = 0 while Mychar != EOF { Mychar = decode(Input) //read symbol to the next symbol put Mychar, Output frequencyarray[c]++ Adjustcounter = Adjustcounter + 1 if Adjustcounter > 4096 adapt tree set all of frequencyarray[] to 1 Adjustcounter = 0 end if } This will reduce our horror scenario from 1.97639168exp10 bytes down to 4.825.175 bytes swirled thru our memory. Where i got the 4096 from, you ask? To be honest, i put it out of my hat. There is no special reason for it - just a page size i encountered a lot, that's all. If you experiment with this algorithm, try it with other sizes. Keep book about the kinds of data you crunch and which Re-Adjusting size gets you which results. The other way would be an intelligent algorithm watching the input stream giving alarm, when the data, rendered in the tree, is outdated by the incoming new data. How could such an algorithm look like? Well, there are a lot of possibilities. One basic mechanism would be: - Have the 'best' few codes of our tree ready. - Check the incoming byte. - actualize a second statistic. - compare it with the 'best' codes of our tree if the second statistic has no more 'best' bytes with the tree 'best' bytes in common, then alarm and adapt. Let's examine the steps: - Have the 'best' few codes of our tree ready. That's easy: We just iterate thru our tree and keep book of the best, say, 10 codes in a little array. As we execute this routine only when we (re)build our tree, it can be written a little bit more sloppy (always remember: the most timeconsuming code (about 80% time) is consumed in only 20% of the code. So don't optimize where it is not appropiate). -Check the incoming byte Well, that's easy, too. After we received the byte from the I/O-routine, we have it ready for dinner. -actualize a second statistic Hm, that's not so easy. I would advise a shuffle-algorithm. For this we initialize a 256-byte statistic array with the accompaning char-values (array[0] = 0x00; array[1] = 0x01 etc.) When the first byte is read, we localize it in this array WITH A LOOP!!! Then we shuffle all bytes from the specific found-index down to index 0 one position up and insert the byte at array[0]. Here one example (array shortened to 4 entries): array[0] = 0x00 array[1] = 0x01 array[2] = 0x02 array[3] = 0x03 incoming data: 0x02. Now we look for the byte in our array, fetching the INDEX. The char 0x02 is found at the position 2 in the array. Now we shuffle all bytes below one position up: array[0] = 0x00 >- keeps it's value array[1] = 0x00 >- gets overwritten array[2] = 0x01 >- gets overwritten array[3] = 0x03 Then we patch array-pos[0] with the char we wanted to find (0x02): array[0] = 0x02 >- patched array[1] = 0x00 array[2] = 0x01 array[3] = 0x03 Next byte could be a 0x02 again. We would find it on position 0. We would shuffle all bytes from position 0 till position 0 up by one (that is, im this case the loop terminates immediately, as the condition is flagged before the body of the loop could be executed) and we patch position 0 with the byte we wanted to find (the same byte). If you implement this algorithm this is a point where a little if-clause could help saving CPU-time. Next byte could be a 0x03. We would find it on position 3. All bytes are shuffled one position up from 3 till 0: array[0] = 0x02 >- keeps it's value array[1] = 0x02 >- gets overwritten array[2] = 0x00 >- gets overwritten array[3] = 0x01 >- gets overwritten Array-Pos 0 is patched with the byte we wanted to find (0x03): array[0] = 0x03 >- patched array[1] = 0x02 array[2] = 0x00 array[3] = 0x01 Next byte could be a 0x02 again. We would find it on position 1. All bytes are shuffled one position up from 1 till 0: array[0] = 0x03 >- keeps it's value array[1] = 0x03 >- gets overwritten array[2] = 0x00 array[3] = 0x01 Array-Pos 0 is patched with the byte we wanted to find (0x02): array[0] = 0x02 >- patched array[1] = 0x03 array[2] = 0x00 array[3] = 0x01 etc. This algorithm guarantees that the most recently appeared bytes are in the first places of the array. With this we have our second statistic acualized. Note that there is a lot of counting and memory reading going on here. This could easily lead to a new horror-scenario here. An intelligent MemCpy, taking the adjacent bytes of the array into account, so copying first in longs, then in bytes would be appropiate here. - compare it with the 'best' codes of our tree if the second statistic has no more 'best' bytes with the tree 'best' bytes in common, then alarm and adapt. Now, that's easy again, after having our statistic actualized so eloquently. We just compare the first, say, each of the 10 entries of our second statistic with the best 10 tree-entries. If there is no more hit, we rebuild the tree anew and reset the tree-statistic. If we calculate what amount of bytes we read/write: 3.845.120 times x adding one to a frequencyarray = 3.845.120 + 3.845.120 times x searching in the shufflearray = (3.845.120 * 32). 32 because the most recent bytes WILL be in the front of the array and therefore be found faster. = 123.043.840 + 3.845.120 times x shuffling the array shufflearray = (3.845.120 * 32*31). = 3.814.359.040 + 3.845.120 times x patching the first entry = 3.845.120 + 3.845.120 times x executing a 10*10 loop looking for one topmost byte in the other topmost array (statistically leaving the loop after the half, only 50 steps instead of 100). = 192.256.000 + (plus the resorting of the arrays and the rebuilding of the tree. For the sake of the demonstration this is not calculated here) Makes = ==4.137.349.120 That is not quite 4 Gigabyte memory reading and writing here. Is this worth the effort? You decide. I prefer the counter-method, but you'll notice that the shuffle-to-front-algorithm can be useful for more than this kind of work. Indeed we'll meet it again when we come to the burrows-wheeler compression. I have a question. Yes, Watson? This adaptive technique - is it only used in combination with the Huffman algorithm? No, the principle of making a statistical algorithm more flexible thru the adaptive approach is not limited to Huffman. When we come to the arithmetic crunching technique we'll come back to the adaptive approach again, believe me. But the next thing we will discuss are some non-statisticall things. Till then...happy cracking :-) I hope you enjoyed this chapter. There will be more - i promise. Just have a little patience. Greetings from a programmer JoaP.S.