papers
+HCU papers

Little essay about the various methods and viewpoints of crunching
~ version June 1998 ~
by Joa Part. III

Courtesy of fravia's page of reverse engineering

Well, Joa continues his fundamental paper on crunching, this is part III, enjoy!

05 June 98 Joa ~ crunchi1.htm Little essay about the various methods and viewpoints of crunching papers ~fra_0126
10 June 98 Joa ~ crunchi2.htm Little essay about the various methods and viewpoints of crunching II papers ~fra_0129
17 June 98 Joa ~ crunchi3.htm Little essay about the various methods and viewpoints of crunching III papers ~fra_012E
17 June 98 Joa ~ crunchi4.htm Little essay about the various methods and viewpoints of crunching IV papers ~fra_012F


Little essay about the various methods and viewpoints of crunching.

Part III: Adapting to the sources (Huffman revisited)



By Joa Koester (JoKo2000@HotMail.Com) 
This essay is free to be copied and read and spread as long as
you are decent enough to leave my name in it. 
If you have corrections, comments, better examples than the ones used 
in this essay, etc. - drop me a line.


OK, let's dive into the world of crunchers and munchers...

In the last chapter Dr. Watson stumbled over the problem of big
files with so many charakters in it that the frequency of all
256 possible chars are -statisticalwise- nearly the same. 
Running a Huffman-algorithm on such files leads to trees with
8bits encoding for each char - thus giving us no crunching 
efficiency.
But on the other hand everyone with a hex-editor in his 
toolbox has made the experience that files are 
somehow seperated (chunked).
Some areas seem to yield code, others appear as vast deserts of 
zeros, and again others have up- and down-counting values.
On the PC you have for example Code and Resource areas, which are
seperately created and just linked together (hence the Borland
Resource Workshop 4.5). 
But how do we take advantage of this circumstance?

Well, the key to our problem is the tree. Once it is created,
we are doomed with it, unable to change it, only able to 
encode our data with this static, inflexible tree, right?

Not quite! If we could adjust our tree to the actual data somehow,
we would be able to generate several trees in the runtime, each
tree nearly optimized for the actual data coming up. But how do
we adjust our tree? And what about the decoder? And how do we
build up the first tree?

Yes, the first tree. What about it? We are about to be flexible, right?
Could we take the pain of the first data not perfectly encoded?
What, if we could? What, if we would build up a tree with all
frequencies set to one? Each symbol would be written with 8 bits.
But the more we would read from the file, the more we would adapt
to the data read and the more we could cope with sudden changes in
the data flow.

so the basic idea would be:

Cruncher:
initialize_tree_with_all_frequencies_to_1
while not eof
{
    read char c
    encode c, Output
    frequencyarray[c]++
    adapt tree, c
}

Decruncher:
initialize_tree_with_all_frequencies_to_1
Mychar = 0
while Mychar != EOF
{
    Mychar = decode(Input) //read symbol to the next symbol
    put Mychar, Output
    frequencyarray[Mychar]++
    adapt tree, Mychar
}

This would work and would be stable. Unfortunately, even programmed
in Assembler we would have to wait to many hours to crunch a simple
file. The resorting of the tree after each byte read/written eats up
to many memory-read/writes. Imagine crunching a file like
Winword '95 (Ver 7.0) with it's 3.845.120 bytes. Imagine a tree-struct
having about 20 bytes in memory. Multiply that with 257 = 5140 bytes,
not necessarily stored nearly together or array-like (so read-ahead
memory chips can't outplay their abilities). Imagine this 
5140 bytes getting sorted 3.845.120 times = 1.97639168exp10 bytes
shifted thru your memory just for Micro$oft 
(assumed you would have only 1 sorting step while building the tree)...


There must be a better way, right?

Yes, Watson, there is a better way. And it's not so difficult to implement
it for our purposes.We have to implement a watcher!

This watcher can be an intelligent algorithm or a not-so-intelligent counter
with a conditional rebuild. Let's examine the latter method first:

As long as normal files are not bigger than 4 GByte we can use a simple
32bit-integer for our counter. Let's name it "Adjustcounter".

This counter will be incremented and tested afterwards:

Cruncher:
initialize_tree_with_all_frequencies_to_1
Adjustcounter = 0
while not eof
{
    read char c
    encode c, Output

    frequencyarray[c]++

    Adjustcounter = Adjustcounter + 1
    if Adjustcounter > 4096
        adapt tree
        set all of frequencyarray[] to 1
        Adjustcounter = 0
    end if
}

Decruncher:
initialize_tree_with_all_frequencies_to_1
Mychar = 0
Adjustcounter = 0
while Mychar != EOF
{
    Mychar = decode(Input) //read symbol to the next symbol
    put Mychar, Output

    frequencyarray[c]++

    Adjustcounter = Adjustcounter + 1
    if Adjustcounter > 4096
        adapt tree
        set all of frequencyarray[] to 1
        Adjustcounter = 0
    end if
}

This will reduce our horror scenario from 1.97639168exp10 bytes
down to 4.825.175 bytes swirled thru our memory. Where i got the
4096 from, you ask? To be honest, i put it out of my hat. There is 
no special reason for it - just a page size i encountered a lot, 
that's all. If you experiment with this algorithm, try it with
other sizes. Keep book about the kinds of data you crunch and
which Re-Adjusting size gets you which results.


The other way would be an intelligent algorithm watching the 
input stream giving alarm, when the data, rendered in the tree,
is outdated by the incoming new data. How could such an algorithm
look like? Well, there are a lot of possibilities. One basic
mechanism would be:
- Have the 'best' few codes of our tree ready.
- Check the incoming byte.
- actualize a second statistic.
- compare it with the 'best' codes of our tree
  if the second statistic has no more 'best' bytes
  with the tree 'best' bytes in common, then alarm and adapt.

Let's examine the steps:
- Have the 'best' few codes of our tree ready.
That's easy: We just iterate thru our tree and keep book
of the best, say, 10 codes in a little array.
As we execute this routine only when we (re)build our tree, 
it can be written a little bit more sloppy 
(always remember: the most timeconsuming code (about 80% time) 
is consumed in only 20% of the code. So don't optimize where
it is not appropiate).

-Check the incoming byte
Well, that's easy, too. After we received the byte from
the I/O-routine, we have it ready for dinner.

-actualize a second statistic
Hm, that's not so easy. I would advise a shuffle-algorithm.
For this we initialize a 256-byte statistic array with
the accompaning char-values (array[0] = 0x00; array[1] = 0x01 etc.)
When the first byte is read, we localize it 
in this array WITH A LOOP!!!
Then we shuffle all bytes from the specific found-index 
down to index 0 one position up and insert the byte at array[0].

Here one example (array shortened to 4 entries):
array[0] = 0x00
array[1] = 0x01
array[2] = 0x02
array[3] = 0x03

incoming data: 0x02. Now we look for the byte in our array, 
fetching the INDEX. The char 0x02 is found at the position
2 in the array. Now we shuffle all bytes below one position 
up:
array[0] = 0x00  >- keeps it's value
array[1] = 0x00  >- gets overwritten
array[2] = 0x01  >- gets overwritten
array[3] = 0x03

Then we patch array-pos[0] with the char we wanted to find (0x02):
array[0] = 0x02  >- patched
array[1] = 0x00
array[2] = 0x01
array[3] = 0x03


Next byte could be a 0x02 again.
We would find it on position 0. We would shuffle all bytes from
position 0 till position 0 up by one (that is, im this case
the loop terminates immediately, as the condition is flagged
before the body of the loop could be executed) and we patch
position 0 with the byte we wanted to find (the same byte).
If you implement this algorithm this is a point where a 
little if-clause could help saving CPU-time.


Next byte could be a 0x03.
We would find it on position 3. All bytes are shuffled one 
position up from 3 till 0:
array[0] = 0x02  >- keeps it's value
array[1] = 0x02  >- gets overwritten
array[2] = 0x00  >- gets overwritten
array[3] = 0x01  >- gets overwritten

Array-Pos 0 is patched with the byte we wanted to find (0x03):
array[0] = 0x03  >- patched
array[1] = 0x02
array[2] = 0x00
array[3] = 0x01


Next byte could be a 0x02 again.
We would find it on position 1. All bytes are shuffled one 
position up from 1 till 0:
array[0] = 0x03  >- keeps it's value
array[1] = 0x03  >- gets overwritten
array[2] = 0x00  
array[3] = 0x01  

Array-Pos 0 is patched with the byte we wanted to find (0x02):
array[0] = 0x02  >- patched
array[1] = 0x03
array[2] = 0x00
array[3] = 0x01

etc.
This algorithm guarantees that the most recently appeared bytes
are in the first places of the array. With this we have our 
second statistic acualized. Note that there is a lot of counting
and memory reading going on here. This could easily lead to a
new horror-scenario here. An intelligent MemCpy, taking the 
adjacent bytes of the array into account, so copying first in
longs, then in bytes would be appropiate here.


- compare it with the 'best' codes of our tree
  if the second statistic has no more 'best' bytes
  with the tree 'best' bytes in common, then alarm and adapt.
Now, that's easy again, after having our statistic actualized
so eloquently. We just compare the first, say, each of the 10 entries of
our second statistic with the best 10 tree-entries. If there is no
more hit, we rebuild the tree anew and reset the tree-statistic.

If we calculate what amount of bytes we read/write:
3.845.120 times x adding one to a frequencyarray =     3.845.120 +

3.845.120 times x searching in the 
shufflearray  = (3.845.120 * 32). 32 because 
the most recent bytes WILL be in the front of
the array and therefore be found faster.         =   123.043.840 +

3.845.120 times x shuffling the array 
shufflearray  = (3.845.120 * 32*31).             = 3.814.359.040 +

3.845.120 times x patching the first entry       =     3.845.120 +

3.845.120 times x executing a 10*10 loop
looking for one topmost byte in the 
other topmost array (statistically leaving the
loop after the half, only 50 steps instead of
100).                                            =   192.256.000 +

(plus the resorting of the arrays and the
rebuilding of the tree. For the sake of the
demonstration this is not calculated here)

Makes =                                          ==4.137.349.120 

That is not quite 4 Gigabyte memory reading and writing here. Is this
worth the effort? You decide. I prefer the counter-method, but you'll
notice that the shuffle-to-front-algorithm can be useful for more than
this kind of work. Indeed we'll meet it again when we come to the
burrows-wheeler compression. 


I have a question.

Yes, Watson?

This adaptive technique - is it only used in combination with the
Huffman algorithm?

No, the principle of making a statistical algorithm more flexible 
thru the adaptive approach is not limited to Huffman. When we come
to the arithmetic crunching technique we'll come back to the
adaptive approach again, believe me.



But the next thing we will discuss are some non-statisticall things.

Till then...happy cracking :-)



I hope you enjoyed this chapter. 
There will be more - i promise.
Just have a little patience.

Greetings from a programmer

Joa

P.S.
fravia+ enhanced my crunchi2.htm part with some links to sources. If you have a C/C++ compiler - study them. Use the debugger wisely. Try the sources with EXTREMELY simple files. Don't use more than 5-6 chars in your testfiles. That's more than enough to let the algs do their work. Use runs of bytes ('aaaaaaaa') and sequences (abcd....abcd....abc...cd...). It's astounding how a simple file (aaaaaaaabbbbccdd) can reveal the idea of an algorithm.
I will continue without referencing to those examples, but if you find some sources in the net - fetch them. Analyse them. If you don't get behind the mechanism - wait till i come to the point of explaining this algorithm :-)
It's worth to notice that my implementations of those algs do differ a lot from the most 'historical'/'educational' versions downloadable from the net. There is no 'right' implementation of a 'data compressing' algorithm. If it works and you have your own ideas - go for it.



redhomepage redlinks red anonymity red+ORC redstudents' essays redacademy database
redtools redcounter measures redcocktails redantismut redbots wars redsearch_forms redmail_fravia
redIs reverse engineering legal?