source ref: ebookcpp.html
In this chapter we will examine the Huffman coding and arithmetic coding methods of data compression and develop an implementation of the latter algorithm. The arithmetic coding algorithm allows a tradeoff among memory consumption and compression ratio; our emphasis will be on minimum memory consumption, on the assumption that the result would eventually be used as embedded code within a larger program.
Huffman Coding, Arithmetic Coding, Lookup Tables
Huffman coding is widely recognized as the most efficient method of encoding characters for data compression. This algorithm is a way of encoding different characters in different numbers of bits, with the most common characters encoded in the fewest bits. For example, suppose we have a message made up of the letters 'A', 'B', and 'C', which can occur in any combination. Figure huffman.freq shows the relative frequency of each of these letters and the Huffman code assigned to each one.
Huffman Fixed-length Letter Frequency Code Code +--------+---------+--------------+ A | 1/4 | 00 | 00 | B | 1/4 | 01 | 01 | C | 1/2 | 1 | 10 | +--------+---------+--------------+
The codes are determined by the frequencies: as mentioned above, the letter with the greatest frequency, 'C', has the shortest code, of one bit. The other two letters, 'A' and 'B', have longer codes. On the other hand, the simplest code to represent any of three characters would use two bits for each character. How would the length of an encoded message be affected by using the Huffman code rather than the fixed-length one?
Let's encode the message "CABCCABC" using both codes. The results are shown in Figure huffman.fixed.
Huffman Fixed-length Letter Code Code +-------+------------+ C | 1 | 10 | A | 00 | 00 | B | 01 | 01 | C | 1 | 10 | C | 1 | 10 | A | 00 | 00 | B | 01 | 01 | C | 1 | 10 | +-------+------------+ Total bits used 12 16
Here we have saved one-fourth of the bits required to encode this message; often, the compression can be much greater. Since we ordinarily use an eight-bit ASCII code to represent characters, if one of those characters (such as carriage return or line feed) accounts for a large fraction of the characters in a file, giving it a short code of two or three bits can reduce the size of the file noticeably.
Let's see how arithmetic coding1 would encode the first three characters of the same message, "CAB".
Cum. Previous Current Output Message Freq. Freq. Codes Output Output So Far +-----+----+---------------------+-------+---------+-------+ A | 16 | 16 |000000( 0)-001111(15)| None | 00 | 00 | B | 16 | 32 |010000(16)-011111(31)| None | 01 | 01 | * C | 32 | 64 |100000(32)-111111(63)| None | 1 | 1 | +-----+----+---------------------+-------+---------+-------+
Figure aritha1 is the first of several figures which contain the information needed to determine how arithmetic coding would encode messages of up to three characters from an alphabet consisting of the letters 'A', 'B', and 'C', with frequencies of 1/4, 1/4, and 1/2, respectively. The frequency of a message composed of three characters chosen independently is the product of the frequencies of those characters. Since the lowest common denominator of these three fractions is 1/4, the frequency of any three-character message will be a multiple of (1/4)3, or 1/64. For example, the frequency of the message "CAB" will be (1/2)*(1/4)*(1/4), or 1/32 (=2/64). For this reason, we will express all of the frequency values in terms of 1/64ths.
Thus, the "Freq." column signifies the expected frequency of occurrence of each message, in units of 1/64; the "Cum. Freq." column accumulates the values in the first column; the "Codes" column indicates the range of codes that can represent each message2; the "Previous Output" column shows the bits that have been output before the current character was encoded; the "Current Output" column indicates what output we can produce at this point in the encoding process; and the "Output So Far" column shows the cumulative output for that message, starting with the first character encoded.
As the table indicates, since the first character happens to be a 'C', then we can output "1", because all possible messages starting with 'C' have codes starting with a "1". Let's continue with Figure aritha2 to see the encoding for a two-character message.
Cum. Previous Current Output Message Freq. Freq. Codes Output Output So Far +-----+----+---------------------+-------+---------+-------+ AA | 4 | 4 |000000(00)-000011(03)| 00 | 00 | 0000 | AB | 4 | 8 |000100(04)-000111(07)| 00 | 01 | 0001 | AC | 8 | 16 |001000(08)-001111(15)| 00 | 1 | 001 | BA | 4 | 20 |010000(16)-010011(19)| 01 | 00 | 0100 | BB | 4 | 24 |010100(20)-010111(23)| 01 | 01 | 0101 | BC | 8 | 32 |011000(24)-011111(31)| 01 | 1 | 011 | * CA | 8 | 40 |100000(32)-100111(39)| 1 | 00 | 100 | CB | 8 | 48 |101000(40)-101111(47)| 1 | 01 | 101 | CC | 16 | 64 |110000(48)-111111(63)| 1 | 1 | 11 | +-----+----+---------------------+-------+---------+-------+
After encoding the first two characters of our message, "CA", our cumulative output is "100", since the range of codes for messages starting with "CA" is from "100000" to "100111"; all these codes start with "100". The whole three-character message is encoded as shown in Figure aritha3.
We have generated exactly the same output from the same input as we did with Huffman coding. So far, this seems to be an exercise in futility; is arithmetic coding just another name for Huffman coding?
These two algorithms provide the same compression efficiency only when the frequencies of the characters to be encoded happen to be representable as integral powers of 1/2, as was the case in our examples so far; however, consider the frequency table shown in Figure huffman.poor.
Cum. Previous Current Output Message Freq. Freq. Codes Output Output So Far +-----+----+---------------------+-------+---------+-------+ AAA | 1 | 1 |000000(00)-000000(00)| 0000 | 00 | 000000| AAB | 1 | 2 |000001(01)-000001(01)| 0000 | 10 | 000001| AAC | 2 | 4 |000010(02)-000011(03)| 0000 | 1 | 00001 | ABA | 1 | 5 |000100(04)-000100(04)| 0001 | 00 | 000100| ABB | 1 | 6 |000101(05)-000101(05)| 0001 | 01 | 000101| ABC | 2 | 8 |000110(06)-000111(07)| 0001 | 1 | 00011 | ACA | 2 | 10 |001000(08)-001001(09)| 001 | 00 | 00100 | ACB | 2 | 12 |001010(10)-001011(11)| 001 | 01 | 00101 | ACC | 4 | 16 |001100(12)-001111(15)| 001 | 1 | 0011 | BAA | 1 | 17 |010000(16)-010000(16)| 0100 | 00 | 010000| BAB | 1 | 18 |010001(17)-010001(17)| 0100 | 01 | 010001| BAC | 2 | 20 |010010(18)-010011(19)| 0100 | 1 | 01001 | BBA | 1 | 21 |010100(20)-010100(20)| 0101 | 00 | 010100| BBB | 1 | 22 |010101(21)-010101(21)| 0101 | 01 | 010101| BBC | 2 | 24 |010110(22)-010111(23)| 0101 | 1 | 01011 | BCA | 2 | 26 |011000(24)-011001(25)| 011 | 00 | 01100 | BCB | 2 | 28 |011010(26)-011111(27)| 011 | 01 | 01101 | BCC | 4 | 32 |011100(28)-011111(31)| 011 | 1 | 0111 | CAA | 2 | 34 |100000(32)-100001(33)| 100 | 00 | 10000 | * CAB | 2 | 36 |100010(34)-100011(35)| 100 | 01 | 10001 | CAC | 4 | 40 |100100(36)-100111(39)| 100 | 1 | 1001 | CBA | 2 | 42 |101000(40)-101001(41)| 101 | 00 | 10100 | CBB | 2 | 44 |101010(42)-101011(43)| 101 | 01 | 10101 | CBC | 4 | 48 |101100(44)-101111(47)| 101 | 1 | 1011 | CCA | 4 | 52 |110000(48)-110011(51)| 11 | 00 | 1100 | CCB | 4 | 56 |110100(52)-110111(55)| 11 | 01 | 1101 | CCC | 8 | 64 |110000(56)-111111(63)| 11 | 1 | 111 | +-----+----+---------------------+-------+---------+-------+
Huffman Fixed-length Letter Frequency Code Code +--------+---------+--------------+ A | 3/4 | 0 | 0 | B | 1/4 | 1 | 1 | +--------+---------+--------------+
Using Huffman coding, there is no way to assign theoretically optimal codes to characters with such a frequency distribution.3 Since the shortest possible Huffman code is one bit, the best we can do is to assign a one-bit code to each character, although this does not reflect the difference in their frequencies. In fact, Huffman coding can never provide any compression at all with a two-character alphabet.
However, such a situation is handled very well indeed by arithmetic compression. To see how, we will start by asking a fundamental question: what is a bit?
A bit is the amount of information required to specify an alternative that has a frequency of 50%; two bits can specify alternatives that have a frequency of 25%, and so forth. For example, the toss of a coin can result in either heads or tails, each of which can be optimally represented by a one-bit code; similarly, if a chance event has four equally likely outcomes, we can express each possible result most economically with two bits. On the other hand, as we have seen in our discussion of Huffman codes, we can have a number of alternatives that are not equally likely; in that case, we assign longer codes for those alternatives that are less likely. However, the shortest possible code in Huffman coding is one bit, which is assigned to an outcome with a frequency of one-half.
The general formula for the optimal length of a code specifying a particular outcome with frequency f is "log2(1/f)". In our previous examples, an outcome with a frequency of .5 should have a code length of log2(1/(1/2)) or 1 bit. Similarly, if an outcome has a frequency of .25 it should have a code length of log2(1/(1/4)), or two bits.
But what if one of the possible outcomes has a frequency greater than one-half? Logically, we should use less than one bit to specify it. For example, if we have a data file in which 84% of the characters are spaces, we can calculate the appropriate number of bits in the code for that character as log2(1/(.84)), or approximately .25 bits. If the remaining 255 characters all have equal frequencies, each of these frequencies is .0627%, so that our formula reduces to log2(1/(.0627)), or approximately 10.63 bits each. This would result in an average code length of (.84)*(.25) + (.16)*10.63, or 1.91 bits per character. By contrast, a Huffman code would require (.84)*1 + (.16)*9, or 2.28 bits per character. If we were compressing a 250-Kbyte file with such characteristics, using Huffman codes would produce a 71-Kbyte file, whereas arithmetic coding would result in a 60-Kbyte file, about a 20% difference between these two approaches.4
Of course, we can't output a code of less than one bit. However, we can use a set of one or more bits to represent more than one character in the same message. This is why the statement that Huffman coding is the most efficient way to represent characters is true, but misleading; if our messages contain more than one character each, we may be able to combine the codes for a number of characters while consuming a fractional number of bits for some characters. To make this clearer, let's go through the encoding of a three-character message, "ABA", from a two-character alphabet in which the letter 'A' accounts for three-fourths of all the characters and the letter 'B' the remaining one-fourth. The situation after we see the first character is shown in Figure arithb1.
Message Cum. Previous Current Output So Far Freq. Freq. Codes Output Output So Far +-----+----+---------------------+-------+---------+-------+ * A | 48 | 48 |000000( 0)-101111(47)| None | None | None | B | 16 | 64 |110000(48)-111111(63)| None | 11 | 11 | +-----+----+---------------------+-------+---------+-------+
If the first character were a 'B', then we could output "11", because all possible messages starting with 'B' have codes starting with those two bits. However, what can we output when the first character is an 'A'?
Nothing! We don't know whether the first bit of our encoded message will be a 0 or a 1; that depends on what happens next. Remember, messages starting with the letter 'A' can have codes starting with 00 through 10, or three-fourths of all possible codes. An 'A' gives us somewhat less than 1/2 bit of information, not nearly enough to produce any output by itself. Now let's look at Figure arithb2 for the information needed to encode the next character.
Message Cum. Previous Current Output So Far Freq. Freq. Codes Output Output So Far +-----+----+---------------------+-------+---------+-------+ AA | 36 | 36 |000000( 0)-100011(35)| None | None | None | * AB | 12 | 48 |100100(36)-101111(47)| None | 10 | 10 | BA | 12 | 60 |110000(48)-111011(59)| 11 | None | 11 | BB | 4 | 64 |111100(60)-111111(63)| 11 | 11 | 1111 | +-----+----+---------------------+-------+---------+-------+
We have two bits of output, since all the codes for messages starting with "AB" have the initial bits "10".5 Let's continue with Figure arithb3 for the third (and last) character of our message.
Message Cum. Previous Current Output So Far Freq. Freq. Codes Output Output So Far +-----+----+---------------------+-------+---------+-------+ AAA | 27 | 27 |000000( 0)-011010(26)| None | 0 | 0 | AAB | 9 | 36 |011011(27)-100011(35)| None | None | None | * ABA | 9 | 45 |100100(36)-101100(44)| 10 | None | 10 | ABB | 3 | 48 |101101(45)-101111(47)| 10 | 11 | 1011 | BAA | 9 | 57 |110000(48)-111000(56)| 11 | None | 11 | BAB | 3 | 60 |111001(57)-111011(59)| 11 | 10 | 1110 | BBA | 3 | 63 |111100(60)-111110(62)| 1111 | None | 1111 | BBB | 1 | 64 |111111(63)-111111(63)| 1111 | 11 | 111111| +-----+----+---------------------+-------+---------+-------+
We have still produced only two bits of output from our three character message, "ABA". The best we could do with Huffman coding is three bits. However, this is not the extreme case; if we were encoding the most frequent message, "AAA", we would have only a "0" bit.6
This algorithm will work nicely if we happen to know in advance what the frequencies of all the possible characters will be. But how do we acquire this information? If our program reads an input file and writes an output file, we can go through the input file once, counting the number of times a given character appears, build the table of frequencies, and then make a second pass using this table to do the actual encoding. However, this is not possible when we are compressing data as it is being generated, as there is then no input file to be analyzed. This might seem to be an insurmountable obstacle.
Luckily, it is not; calculating the character frequencies as we encode yields about the same compression efficiency as precalculation, and in some cases even better efficiency! The reason for this surprising result is that most files (or sets of data) are not uniform throughout, but rather exhibit local variations in the frequency of a given character or set of characters. Calculating the frequencies on the fly often produces a better match for these changing characteristics.
Therefore, our approach is as follows. Every character is initially assumed to have the same frequency of occurrence. After each character is read and encoded, its entry in the table of frequencies is increased to reflect our having seen it. The next time we encode it, its encoding will account for the fact that we have seen it before.
That may seem quite simple for the sender, but not for the receiver. If the encoding table is being modified as we go, how does the receiver keep in step?
The receiver has the same initial table as the transmitter; each character has the same expected frequency of occurrence. Then, as the receiver decodes each character, it updates its copy of the frequency table. This also explains why we increase a character's frequency after we encode it, rather than before; until the receiver decodes the character, it cannot update the frequency of that character's occurrence.7
In our implementation, we achieve a large improvement in compression efficiency by using the context in which a character occurs when estimating its frequency of occurrence. It should be apparent that the frequency of a given character appearing in a message at a given point is quite dependent on the previous context. For example, in English text, a "Q" is almost certain to be followed by a "u", at least if we exclude articles about software such as DESQview! Ideally, therefore, the amount of information needed to encode a "u" following a "q" should be very small. The same principle, of course, applies to the encoding of a line feed following a carriage return in text files where this is a virtual certainty.
On the other hand, the amount of storage required to keep track of a large amount of previous context can become excessive.8 Even one character of previous context requires the construction of 256 tables of frequencies, one for each possible previous character. A direct extension of the approach given in the reference in footnote would require over 300 KB of storage for these tables.
We will apply a number of space-saving methods to reduce the above storage requirement by about 90%, to approximately 35 KB, while still achieving good data-compression performance in most cases.
In order to achieve such a reduction in memory consumption, we must avoid storing
anything that can be recalculated, such as the cumulative frequencies of all possible
characters.9 We must also dispense with the use of a self-organizing table that
attempts to speed up encoding and decoding by moving more frequently used characters
toward the front of the table, as is done in the reference in footnote . However, we must
provide as large a dynamic range of frequencies as possible: the larger the ratio between
the highest frequency and the lowest, the greater the possible efficiency. The greatest
dynamic range is needed when one character always occurs in a particular context, such as
line feed after carriage return. Assuming that we must be able to encode any of the 256
possible one-byte values, the algorithm limits the possible dynamic range to approximately
one-fourth of the range of an
unsigned value.10 For maximum
compression efficiency we therefore need 256 tables of 256 two-byte entries each,
consuming a total of 128 KB.11 When I first implemented this algorithm, I saw
no way to reduce this significantly.
Then early one morning, I woke up with the answer. We need some very large frequency values and some very small ones, but surely not every one in between. Why not use a code to represent one of a number of frequency values? These values would be spaced out properly to get as close as possible to optimal compression efficiency, but each would be represented by a small index, rather than literally.
How small an index? First I considered using an eight-bit index. But that still would require 64 KB for a complete 256x256 table. Maybe a smaller index would help. But wouldn't that impose an unreasonable processing time penalty?
Amazingly enough, we can use a four-bit index with no time penalty: in fact, processing is actually faster than with a byte index! This seemingly magical feat is accomplished by one of my favorite time-saving methods: the lookup table.
You see, a significant amount of the CPU time needed to encode a symbol is spent in a loop that computes the necessary portion of the cumulative frequency table for the current context. If each frequency value occupied one byte, we would have to execute that loop once for every index in the table up to the ASCII value of the character whose cumulative frequency we are trying to compute. However, since we have packed two indexes into one byte, we can accumulate two indexes worth of frequency values with one addition, thus reducing the number of times the loop has to be executed by approximately 50%.
Each execution of the loop translates a pair of indexes contained in one byte into a
total frequency value that is the sum of their individual frequency values by using the
byte as an index into a 256-word table, which has one entry for each possible combination
of two indexes. This table (the
both_weights table) is calculated once at the
start of the program.
Once we have determined the frequency of occurrence assigned to the character to be encoded, we can decide how many bits we can send and what their values are.
The receiver uses almost the same code as the sender, with two main exceptions. First, the receiver reads the input one bit at a time and outputs each character as soon as it is decoded, just the reverse of the sender. Second, rather than knowing how many frequency entries we have to accumulate in advance as the sender does, we have to accumulate them until we find the one that corresponds to the range we are trying to interpret. The latter situation reduces the efficiency of our loop control, which accounts for much of the difference in speed between encoding and decoding.
Let's start at the beginning, with the
main function in
This function opens the input and output files and sets up buffers to speed up input
and output, then calls
start_model (Figure adapt.00).
start_modelfunction (from compress\adapt.cpp) (Figure adapt.00)
This function starts out by initializing the
which is used to determine when to promote a character to the next higher frequency index
value. As noted above, these values are not consecutive, so that we can use a four-bit
index rather than literal values; this means that we have to promote a character only once
in a while rather than every time we see it, as we would do with literal frequency values.
How do we decide when to do this?
A pseudorandom approach seems best: we can't use a genuine random number to tell us
when to increment the index, because the receiver would have no way of reproducing our
decisions when decoding the message. My solution is to keep a one-byte hash total of the
ASCII codes for all the characters that have been sent (
char_total) and to
increment the index in question whenever
char_total is greater than the
corresponding value stored in the
upgrade_threshold array. That threshold
value is calculated so that the probability of incrementing the index is inversely
proportional to the gap between frequency values in the translation table.12
If, for example, each frequency value in the translation table were twice the previous
value, there would be a 1/2 probability of incrementing the index each time a character
was encountered. After we finish initializing the
upgrade_threshold array, we
char_total to 0, in preparation for accumulating our hash total.
The next operation in
start_model is to generate the
table. As we discussed above, this table allows us to translate from a pair of frequency
values (or weights) to the total frequency to which they correspond. We calculate it by
generating every possible pair and adding them together to fill in the corresponding table
entry. The values in the
translate table are defined in Figure model.00, in
the line that starts with
#define TRANSLATE_TABLE. How did I generate these
It wasn't easy. I knew that I wanted to allow the largest possible dynamic range, which means that the lowest value has to be 1 and the highest value has to be close to the maximum that can be accommodated by the algorithm (16,128). The reason I chose a top value lower than that maximum value is that if the highest value were 16,128, then the occurrence of any character other than the preferred one would cause the total frequency to exceed the allowable maximum, with the result that the table of frequencies would be recalculated to reduce the top value to the next lower step. This would greatly reduce the efficiency of the compression for this case. That accounts for the lowest and highest values. What about the ones in between?
Initially, I decided to use a geometric progression, much like the tuning of a piano; in such a progression, each value is a fixed multiple of the one before it. However, I found that I achieved better compression on a fairly large sample of files by starting the progression with the second value at 16 and ending with the next to the last at 1024. Why is this so?
The reason for leaving a big gap between the lowest frequency and the second lowest one is that many characters never occur in a particular situation. If they occur once, they are likely to recur later. Therefore, setting the next-to-the-lowest frequency to approximately one-tenth of 1% of the maximum value improves the efficiency of the compression. I have also found through experimentation that the compression ratio is improved if the first time a character is seen, it is given a frequency of about six-tenths of 1%, which requires an initial index of 7. Lower initial frequencies retard adaptation to popular new characters, and higher ones overemphasize new characters that turn out to be unpopular in the long run.
The reason to leave a large gap between the next-to-highest frequency and the highest one is that most of the time, a very skewed distribution has exactly one extremely frequent character. It is rare to find several very high-frequency characters that have about the same frequency. Therefore, allowing the highest frequency to approach the theoretical maximum produces the best results. Of course, these are only empirical findings. If you have samples that closely resemble the data that you will be compressing, you can try modifying these frequency values to improve the compression.
start_model (Figure adapt.00), we initialize
frequency_info tables, one for every possible character. Each of these tables
will store the frequencies for characters that follow a particular character. If we start
to encode the string "This is a test", the first character, 'T' will be encoded
using the table for character 0 (a null); this is arbitrary, since we haven't seen any
characters before this one. Then the 'h' will be encoded using the table for 'T'; the 'i'
will be encoded with the table for 'h', and so forth. This approach takes advantage of the
context dependence of English text; as we noted before, after we see a 'q', the next
character is almost certain to be a 'u', so we should use a very short code to represent a
'u' in that position.
However, our initialization code contains no information about the characteristics of
English text (or any other kind of data for that matter). We assign the same frequency
(the lowest possible) to every character in every frequency table. As discussed above, the
encoding program will learn the appropriate frequencies for each character in each table
as it processes the input data. At the end of the initialization for each table, we set
total_freq value to the translation of the lowest frequency, multiplied
by the number of index pairs in the table. This value is needed to calculate the code that
corresponds to each character, and recalculating it every time we access the table would
The last operation in
start_model initializes the
array, which is used to translate the internal representation of the characters being
encoded and decoded to their ASCII codes.13 Then we return to
The next operation in
main is to call
start_outputing_bitsfunction (from compress\arenc.cpp) (Figure arenc.00)
Of course, we can't send individual bits to the output file; we have to accumulate at
least one byte's worth. Whenever we have some bits to be output, we will store them in
since we haven't encoded any characters yet, we set it to 0. In order to keep track of how
many more bits we need before we can send
buffer to the output file, we set
to eight; then we return to
The next thing we do in
main is to call
start_encodingfunction (from compress\arenc.cpp) (Figure arenc.01)
This is a very short function, but it initializes some of the most important variables
in the program;
first two of these keep track of the range of codes for the current message; at this point
we know nothing about the message, so they are set to indicate the widest possible range
of messages, from 0 to
TOP_VALUE, which is defined in
In our current implementation, with 16-bit code values,
evaluates to 65535.14 The third variable,
track of the number of bits that have been deferred for later output. This is used where
the range of possible codes for the current message includes codes that start with 0 and
some that start with 1; as we have seen already, in such a situation we're not ready to
send out any bits yet. After initializing these variables, we return to
Upon returning to
main, we need to do one more initialization before we
enter the main loop of our program, which is executed once for each character in the input
file; namely, setting
oldch to 0. This variable controls the context in which
we will encode each character. Since we haven't seen any characters as yet, it doesn't
really matter which frequency table we use, as long as the decoding program selects the
same initial table.
The first operation in the main loop is to get the character to be encoded, via
If we have reached the end of the file, we break out of the loop. Otherwise, we call
(Figure arenc.02) to place the representation of the current character in the output
encode_symbolfunction (from compress\arenc.cpp) (Figure arenc.02)
This function takes two parameters:
ch, the character to be encoded, and
which determines the frequency table to be used for this encoding. As we have noted above,
selecting a frequency table based upon the previous character encoded provides far better
compression efficiency, as the frequency of occurrence of a particular character is
greatly affected by the preceding character.
encode_symbol is a short function, it is the subtlest function in
this chapter; fortunately, you can use this algorithm effectively without going through
the explanation below. The version here closely follows the reference in footnote ; if you
are really interested in the details of operation of this function, I strongly advise you
to study that reference very carefully in conjunction with the explanation here.
To clarify this algorithm, we will work through a somewhat simplified example as we
examine the code. For ease in calculation, we will set
TOP_VALUE to 255, or
11111111 binary, rather than 65535 as in the actual program; as a result,
will start out at 255 as well. We will also use a single, constant frequency table (Figure
sampfreq) containing only four entries rather than selecting a 256-entry table according
to the previous character and modifying it as we see each character, so our
both_weights tables (Figures samptrans and sampboth, respectively) will
be adjusted correspondingly. Instead of ASCII, we will use the codes from 0 (for 'A') to 3
(for 'D') in our example message.
Character Frequency code
A,B 00010011 (1,3)
C,D 00000010 (0,2)
both_weightstable (Figure sampboth)
First Second Index Index 0 1 2 3 +----+----+----+----+ 0 | 2 | 3 | 9 | 53 | +----+----+----+----+ 1 | 3 | 4 | 10 | 54 | +----+----+----+----+ 2 | 9 | 10 | 16 | 60 | +----+----+----+----+ 3 | 53 | 54 | 60 |104 | +----+----+----+----+
As we begin
encode_symbol, we establish
temp_freq_info as a
pointer to the structure containing the current frequency table. Next, we set
to the address of the frequency table itself and
total_freq to the stored
total frequency in the frequency structure; as we will see shortly,
is used to determine what fraction of the frequency range is accounted for by the
particular character being encoded. The final operation before entering the frequency
accumulation loop is to set
prev_cum to 0. This variable is used to keep
track of the cumulative frequency of all characters up to but not including the one being
encoded; this is used to determine the position of the character being encoded as a part
of the entire range of possibilities.
Now we are ready to enter the frequency accumulation loop, which is shown in Figure arenc.03.
The reason we need this loop is that, as we saw before, we cannot afford to keep the
cumulative frequency tables in memory; they would occupy hundreds of kilobytes of memory.
Instead, we calculate the cumulative frequency for each character being encoded, as we
need it. The
total_freq variable, however, we do maintain from one encoding
to the next; recalculating it would require us to go all the way through the frequency
table for each character, even if we are encoding a character with a low ASCII code. Since
we are saving the total frequency with the table, we have to accumulate the frequencies
only up to the ASCII code of the character we are encoding.
Let's see how this loop accumulates the frequencies of all the characters up to the
character we want to encode. The first interesting item is that the loop executes only
half as many times as the value of the character being encoded, since we are packing two
4-bit indexes into each byte of the frequency table. So the first statement in the loop
retrieves one of these pairs of indexes from the frequency table and increments the
pointer to point to the next pair. Then we index into the
with the index pair we just retrieved and set
total_pair_weight to that entry
in the table. The
both_weights table is the key to translating two 4-bit
indexes into a total frequency. Each entry in the table is the sum of the frequency values
that correspond to the two indexes that make up the byte we use to index into the table.
Finally, we add
prev_cum, which is
accumulating all of the frequencies.
In our example, the first letter of our message is 'D', which has a
symbolvalue of 3. Using the frequency table in Figure sampfreq, we execute the statements in the loop once. First, we set
current_pairto the first entry in the
frequencytable, 00010011, which indicates that the frequency code for 'A' is 1 (0001 binary) and the frequency code for 'B' is 3 (0011 binary). Then we set
total_pair_weightto entry (1,3) from the
both_weightstable, the sum of the frequencies to which this pair of indexes corresponds; its value is 54. The last statement in the loop adds this value to
prev_cum, which was set to 0 before the loop was started.
The next section of code, shown in Figure arenc.04, finishes the accumulation of the cumulative frequencies of the character to be encoded and the previous character, for both of the possible alignments of the character we are encoding and the previous character.
If the target character has an even ASCII code, we already have the correct value for
cum, the frequency accumulation for the current character, we
need the first index from the next byte, which is stored in the high half of the byte. So
we pick up
current_pair, shift its high index down, and use it to retrieve
the corresponding weight from the
translate table. Then we add that frequency
prev_cum to calculate
On the other hand, if the target character has an odd ASCII code, we need to update
cum. First, we add the total weights for the
last two characters to
prev_cum, which results in
cum. Then we
translate the high half of the
current_pair and add that to
In our example, the code of the first character is 3, so we need to update both
cum. The value of
current_pairis (0,2). Looking in the
both_weightstable, we set
total_pair_weightto the translation of that pair, which is 9. Then
cumis calculated as the sum of
total_pair_weight, or 63. Then we extract the high part of
current_pair, 0, which translates to 1; we add this amount to
prev_cum, setting it to 55. This means that the code range associated with character "D" starts at 55 and ends slightly before 64, a range of 9 positions out of the total of 64. This will allow us to send out slightly less than three bits for this character, as we will narrow the range by a factor of more than 7.
Now that we have calculated the cumulative frequencies of the current character and the previous character, we are ready to narrow the range of frequencies that correspond to our message, as shown in Figure arenc.05.
The first line of this section of code calculates the previous range of frequencies for our message. Then the other lines calculate the new range of the message.
The purpose of
high is to delimit the frequency
interval in which the current message falls;
range is just the size of the
frequency interval extending from the old value of
low to the old value of
In our example, the old value of
lowis 0 and the old value of
highis 255. Therefore, the formula for
(long)(high-low)+1, produces 256, which is its maximum value. This makes sense, as we have not yet used the information from the first character of our message.
The new values of
high represent the narrowing of the
previous range due to the frequency of the new character. We calculate the new value of
which represents the high end of the new range of frequencies for our message after the
most recent character is added to the message.15 Similarly, the new value of
represents the low end of the range of frequencies for our message after the most recent
character is added to the message.
In our example,
lowis still 0,
cumis 63, and
total_freqis 63. The expression to calculate
low + (range * cum) / total_freq - 1. Therefore,
highis calculated as 0+(256*63)/63-1, or 255. This means that the new range of frequencies for our message ends slightly below 256. Next, we recalculate
low. Its value is still 0 so far,
prev_cumis 55, and
total_freqis 63. The expression to calculate
low + (range * prev_cum)/total_freq. Therefore, we calculate the new value of
lowas 0+(256*55)/63, or 223. This means that the new range of frequencies for our message begins at 223.
Now we are finally ready to start sending bits out. The loop shown in Figure arenc.06 extracts as many bits as possible from the encoding of the message so far, and widens the range correspondingly to allow the encoding of more characters.
In order to understand this code, we need to look at the table of possible initial bits
low, which is given in Figure initcode.
The entries in this table could use some explanation. The first two columns contain all
the possible combinations of the first two bits of
we know that
low cannot be greater than
high, since these two
values delimit the range of codes for the message being compressed. If
ever became greater than
high, it would be impossible to encode any further
characters. The "Action" column indicates what, if anything, we can output now.
high have the same first bit, we can output
that bit. The entries labeled "Next" indicate that since the separation between
the values of
high is at least one-fourth of the total
range of values (0-
TOP_VALUE), we can encode at least one more character now;
this is the reason for the limit on the total frequency of all characters.
Low High Action +--------+---------+---------+ | 00 | 00 | 0 | | 00 | 01 | 0 | | 00 | 10 | Next | | 00 | 11 | Next | | 01 | 01 | 0 | | 01 | 10 | Defer | | 01 | 11 | Next | | 10 | 10 | 1 | | 10 | 11 | 1 | | 11 | 11 | 1 | +--------+---------+---------+
The entry "Defer" means that we can't do any output now; however, when we do
have output, we will be able to emit at least the first two bits of the result, since we
know already that these bits are either "01" or "10". As we will see
shortly, this condition is indicated by a nonzero value for
The first test determines whether the highest frequency value allocated to our message
is less than one-half of the total frequency range. If it is, we know that the first
output bit is 0, so we call the
bit_plus_follow_0 macro (Figure arenc.07) to
output that bit. Let's take a look at that macro.
First we call
output_0, which adds a 0 bit to the output buffer and writes
it out if it is full. Then, in the event that
bits_to_follow is greater than
0, we call
output_1 and decrement
bits_to_follow until it
reaches 0. Why do we do this?
bit_plus_follow_0macro (from compress\arenc.cpp) (Figure arenc.07)
The reason is that
bits_to_follow indicates the number of bits that could
not be output up till now because we didn't know the first bit to be produced
("deferred" bits). For example, if the range of codes for the message had been
011 through 100, we would be unable to output any bits until the first bit was decided.
However, once we have enough information to decide that the first bit is 0, we can output
three bits, "011". The value of
bits_to_follow would be 2 in that
case, since we have two deferred bits. Of course, if the first bit turns out to be 1, we
would emit "100" instead. The reason that we know that the following bits must
be the opposite of the initial bit is that the only time we have to defer bits is when the
code range is split between codes starting with 0 and those starting with 1; if both
high started with the same bit, we could send that bit out.
The values of
THIRD_QTRare all based on
TOP_VALUE(Figure arith.00). In our example,
HALFis 128, and
THIRD_QTRis 192. The current value of
highis 255, which is more than
HALF, so we continue with our tests.
Assuming that we haven't obtained the current output bit yet, we continue by testing
the complementary condition. If
low is greater than or equal to
we know that the entire frequency range allocated to our message so far is in the top half
of the total frequency range; therefore, the first bit of the message is 1. If this
occurs, we output a 1 via
bit_plus_follow_1. The next two lines reduce
HALF, since we know that both are above that value.
In our example,
lowis 223, which is more than
HALF. Therefore, we can call
bit_plus_follow_1to output a 1 bit. Then we adjust both
HALF, to account for the 1 bit we have just produced;
lowis now 95 and
If we haven't passed either of the tests to output a 0 or a 1 bit, we continue by
testing whether the range is small enough but in the wrong position to provide output at
this time. This is the situation labeled "Defer" in figure initcode. We know
that the first two bits of the output will be either 01 or 10; we just don't know which of
these two possibilities it will be. Therefore, we defer the output, incrementing
to indicate that we have done so; we also reduce both
FIRST_QTR, since we know that both are above that value. If this seems
mysterious, remember that the encoding information we want is contained in the differences
high, so we want to remove any redundancy in
their values.16 If we get to the
break statement near the end of
the loop, we still don't have any idea what the next output bit(s) will be. This means
that the range of frequencies corresponding to our message is still greater than 50% of
the maximum possible frequency; we must have encoded an extremely frequent character,
since it hasn't even contributed one bit! If this happens, we break out of the output
loop; obviously, we have nothing to declare.
In any of the other three cases, we now have arrived at the statement
<<= 1; with the values of
to be less than half the maximum possible frequency.17 Therefore, we shift the
high up one bit to make room for our next pass
through the loop. One more detail: we increment
high after shifting it
because the range represented by
high actually extends almost to the next
frequency value; we have to shift a 1 bit in rather than a 0 to keep this relationship.
In our example,
lowis 95 and
highis 127, which represents a range of frequencies from 95 to slightly less than 128. The shifts give us 190 for
lowand 255 for
high, which represents a range from 190 to slightly less than 256. If we hadn't added 1 to
high, the range would have been from 190 to slightly less than 255.
Since we have been over the code in this loop once already, we can continue directly
with the example. We start out with
low at 190 and
high at 255.
high is not less than
HALF (128), we proceed to the second
low turns out to be greater than
HALF. So we call
again as on the first loop and then reduce both
producing 62 for
low and 127 for
high. At the bottom of the
loop, we shift a 0 into
low and a 1 into
high, resulting in 124
and 255, respectively.
On the next pass through the loop,
high is not less than
low isn't greater than or equal to
isn't less than
THIRD_QTR (192), so we hit the
break and exit
the loop. We have sent two bits to the output buffer.
We are finished with
encode_symbol for this character. Now we will start
processing the next character of our example message, which is 'B'. This character has a
symbol value of 1, as it is the second character in our character set. First, we set
to 0. The frequency accumulation loop in Figure arenc.03 will not be executed at all,
symbol/2 evaluates to 0; we fall through to the adjustment code in
Figure arenc.04 and select the odd path after the
else, since the symbol code
is odd. We set
current_pair to (1,3), since that is the first entry in the
frequency table. Then we set
total_pair_weight to the corresponding entry in
both_weights table, which is 54. Next, we set
cum to 0 + 54,
or 54. The high part of the current pair is 1, so
entry 1 in the
translate table, or 2; we add this to
which becomes 2 as well.
Now we have reached the first line in Figure arenc.05. Since the current value of
is 124 and the current value of
high is 255, the value of
becomes 131. Next, we recalculate
high as 124 + (131*54)/63 - 1, or 235. The
new value of
low is 124 + (131*2)/63, or 128. We are ready to enter the
high is not less than
HALF, so the first test fails.
low is equal to
HALF, so the second test succeeds.
Therefore, we call
bit_plus_follow_1 to output a 1 bit; it would also output
any deferred bits that we might have been unable to send out before, although there aren't
any at present. We also adjust
high by subtracting
to account for the bit we have just sent; their new values are 0 and 107, respectively.
Next, we proceed to the statements beginning at
low <<= 1;, where we
high up, injecting a 0 and a 1, respectively; the
new values are 0 for
low and 215 for
high. On the next pass
through the loop we will discover that these values are too far apart to emit any more
bits, and we will break out of the loop and return to the main function.
We could continue with a longer message, but I imagine you get the idea. So let's
return to the
main function (Figure encode.00).18
The next function called is
update_model (Figure adapt.01), which adjusts
the frequencies in the current frequency table to account for having seen the most recent
update_modelfunction (from compress\adapt.cpp) (Figure adapt.01)
The arguments to this function are
symbol, the internal value of the
character just encoded, and
oldch, the previous character encoded, which
indicates the frequency table that was used to encode that character. What is the internal
value of the character? In the current version of the program, it is the same as the ASCII
code of the character; however, near the end of the chapter we will employ an optimization
that involves translating characters from ASCII to an internal code to speed up
This function starts out by adding the character's ASCII code to
a hash total which is used in the simple pseudorandom number generator that we use to help
decide when to upgrade a character's frequency to the next frequency index code. We use
symbol_translation table to get the ASCII value of the character before
adding it to
char_total; this is present for compatibility with our final
version which employs character translation.
The next few lines initialize some variables:
old_weight_code, which we
set when changing a frequency index code or "weight code", so that we can update
the frequency total for this frequency table;
temp_freq_info, a pointer to
the frequency table structure for the current character; and
address of the frequency table itself.
Next, we compute the index into the frequency table for the weight code we want to
examine. If this index is even, that means that this symbol is in the high part of that
byte in the frequency table. In this case, we execute the code in the
branch of the
if statement "
if (symbol % 2 == 0)".
This starts by setting
temp_freq to the high four bits of the table entry. If
the result is 0, this character has the lowest possible frequency value; we assume that
this is because it has never been encountered before and set its frequency index code to
Then we update the
total_freq element in the frequency table.
temp_freq is not 0, we have to decide whether to upgrade this
character's frequency index code to the next level. The probability of this upgrade is
inversely proportional to the ratio of the current frequency to the next frequency; the
larger the gap between two frequency code values, the less the probability of the upgrade.
So we compare
char_total to the entry in
is greater, we want to do the upgrade, so we record the previous frequency code in
HIGH_INCREMENT to the byte containing the frequency index for the
current character. We have to use
HIGH_INCREMENT rather than 1 to adjust the
frequency index, since the frequency code for the current character occupies the high four
bits of its byte.
Of course, the character is just as likely to be in the low part of its byte; in that
case, we execute the code in the
false branch of that
statement, which corresponds exactly to the above code. In either case, we follow up with
if statement whose condition is "
(old_weight_code != -1)",
which tests whether a frequency index code was incremented. If it was, we add the
difference between the new code and the old one to the
total_freq entry in
the current frequency table, unless the character previously had a frequency index code of
0; in that case, we have already adjusted the
The last operation in
update_model is to make sure that the total of all
frequencies in the frequency table does not exceed the limit of
if that were to happen, more than one character might map into the same value between
low, so that unambiguous decoding would become impossible. Therefore, if
MAX_FREQUENCY, we have to reduce the frequency indexes until this is
no longer the case. The
while loop whose continuation expression is "
> MAX_FREQUENCY)" takes care of this problem in the following way.
First, we initialize
temp_total_freq to 0, as we will use it to accumulate
the frequencies as we modify them. Then we set
freq_ptr to the address of the
first entry in the frequency table to be modified. Now we are ready to step through all
the bytes in the frequency table; for each one, we test whether both indexes in the
current byte are 0. If so, we can't reduce them, so we just add the frequency value
corresponding to the translation of two 0 indexes (
Otherwise, we copy the current index pair into
freq. If the high index is
nonzero, we decrement it. Similarly, if the low index is nonzero, we decrement it. After
handling either or both of these cases, we add the translation of the new index pair to
After we have processed all of the index values in this way, we retest the
condition, and when
temp_total_freq is no longer out of range, we store it
back into the frequency table and return to the main program.
Finally, we have returned to the
main function (Figure encode.00), where
oldch, so that the current character will be used
to select the frequency table for the next character to be encoded; then we continue in
the main loop until all characters have been processed.
When we reach
EOF in the input file, the main loop terminates; we use
EOF_SYMBOL, which tells the receiver to stop decoding. Then we call
done_encoding (Figure arenc.08) and
arenc.09) to flush any remaining information to the output file. Finally, we close the
input and output files and exit.
done_encodingfunction (from compress\arenc.cpp) (Figure arenc.08)
done_outputing_bitsfunction (from compress\arenc.cpp) (Figure arenc.09)
Now that we have developed a working version of our program, let's see how much more performance we can get by judicious optimization in the right places. The first step is to find out where those places are. Therefore, I ran the program under Microsoft's profiler on a file of approximately 11 KB, compressing it to a little less than 5 KB (a compression ratio of 2.3 to 1). This didn't take very long, as you can see from the profile in Figure encode.pro1.
Total time: 167.100 milliseconds
Time % Function
80.010 49.0 encode_symbol(unsigned int,unsigned int) (arenc.obj)
29.139 17.9 _main (encode.obj)
18.717 11.5 output_1(void) (arenc.obj)
17.141 10.5 output_0(void) (arenc.obj)
15.957 9.8 update_model(int,unsigned char) (adapt.obj)
2.234 1.4 start_model(void) (adapt.obj)
The majority of the CPU time is spent in the
encode_symbol function, so
that's where we'll look first. As I mentioned in the discussion of the algorithm, a
significant amount of the CPU time needed to encode a symbol is spent in the loop shown in
Figure arenc.03. Accordingly, we should be able to increase the speed of our
implementation by using a more efficient representation of the characters to be encoded,
so that the accumulation loop will be executed fewer times in total. While we cannot spare
the space for self-organizing tables of character translations, a fixed translation of
characters according to some reasonable estimate of their likelihood in the text is
certainly possible. Let's see how that would affect our performance.
After analyzing a number of (hopefully representative) text files, I used the results
to produce a list of the characters in descending order of number of occurrences. This
information is stored in the
symbol_translation table (Figure adapt.02),
which is used during decoding to translate from the internal code into ASCII.
symbol_translationtable (from compress\adapt.cpp) (Figure adapt.02)
The inverse transformation, used by the encoder, is supplied by the
table, which is initialized near the end of the
start_model function (Figure
To use these new values for each character when encoding and decoding, we have to make
some small changes in the encoding and decoding main programs. Specifically, in
we have to use the translated character rather than the ASCII code as read from the input
file when we call the
encode_symbol function. The new version of the main
program for encoding is shown in Figure encode1.00.
Of course, corresponding changes have to be made to the
main function in
the decoding program, which then looks like Figure decode1.00.
Figure char.trans shows the speed improvement resulting from this modification.
Total time: 145.061 millisecond
Time % Function
58.980 41.8 encode_symbol(unsigned int,unsigned int) (arenc.obj)
25.811 18.3 _main (encode1.obj)
18.964 13.4 output_0(void) (arenc.obj)
18.586 13.2 output_1(void) (arenc.obj)
16.565 11.7 update_model(int,unsigned char) (adapt.obj)
2.213 1.6 start_model(void) (adapt.obj)
This speed improvement of about 15% took a total of about an hour, mostly to write and debug the character counting program and run it on a number of sample files. However, this change isn't helpful in all cases. It will speed up the handling of files that resemble those used to calculate the translation tables, but will have much less impact on those having a significantly different mix of data characters. In the worst case, it might even slow the program down noticeably; for example, a file containing many nulls, or 0 characters, might be compressed faster without translation, as encoding the most common character would then require no executions of the accumulation loop. This is an example of the general rule that knowing the makeup of the data is important in deciding how to optimize the program.
In this chapter, we have seen an example of how intelligent encoding of information can pack a lot of data into a relatively small amount of memory. In the next chapter, we will see how quantum files allow us to gain efficient random access to a large volume of variable-length textual data.
(You can find suggested approaches to problems in Chapter artopt.htm).
unsigneds, the maximum cumulative frequency is 16,383 if we wish to avoid the use of
longarithmetic. This limits the dynamic range to 16,128 for the one common character and 1 for each of the other characters.
upgrade_thresholdarray is set to 255, which prevents the index from being increased beyond 15, the maximum value that can be stored in a four-bit value;
char_total, being an
unsigned charvariable, cannot exceed 255.
high, we must wait until we examine
encode_symbol(Figure arenc.02). For now, let's just say that they indicate the current state of our knowledge about the message being encoded; the closer together they are, the more bits we are ready to output.
high+1; even the authors of the reference in footnote , which is the source of the original version of this algorithm, admit that this is confusing!
high. Rest assured that it will not be left twisting in the wind; there is a function called
done_encodingthat makes sure that all of the remaining information is encoded in the output stream.