9.2 Data compression 159
to 34.7%. A closer look at the table data shows that the factor that appears to increase the
compression ratio is the frequency spread in the top group of most frequent characters.
If the most frequent characters have dissimilar frequencies, then shorter codewords can
be assigned to a larger number of symbols. We observe that the first three datafiles,
corresponding to nontypical English texts, lend themselves to greater compression than
the fourth datafile, corresponding to ordinary English. There is no need to go through
tedious statistics to conclude beforehand that increasing the length of such English-text
datafiles would give compression ratios increasingly closer to the limit of
˜
r = 15.76%.
Clearly, this is because the probability distribution of long English-text sequences will
duplicate with increasing fidelity the standard distribution for which we have found this
compression limit. On the other hand, shorter sequences of only a few characters might
have significantly higher compression ratios. To take an extreme example, the datafile
“AAA” (for American Automobile Association) takes 1 bit/symbol and, thus, has a
compression ratio of r = 1 −1/5 = 80%. If we take the full ASCII code for reference
(7 bit/character), the compression becomes r = 1 −1/7 = 85.7%.
7
The above examples have shown that for any given datafile, there exists an optimal
(Huffman) code that achieves maximum data compression. As we have seen, the code-
word assignment is different for each datafile to be compressed. Therefore, one needs to
keep track of which code is used for compression in order to be able to recover the orig-
inal, uncompressed data. This information, which we refer to as overhead, must then be
transmitted along with the compressed data, which we refer to as payload. Since the over-
head bits reduce the effective compression rate, it is clear that the overhead size should
be the smallest possible relatively to the payload. In the previous examples, the overhead
is simply the one-to-one correspondence table between codewords and symbols. Using
five-bit (ASCII) codewords to designate each of the character symbols, and a five-bit
field to designate the corresponding compressed codewords makes a ten-bit overhead
per datafile symbol. Taking, for instance, datafile 3 (Table 9.2), there are 13 symbols,
which produces 130 bits of overhead. It is easily calculated that the payload represents
111 bits, which leads to a total of 130 + 111 = 241 bits for the complete compressed
file (overhead + payload). In contrast, a five-bit ASCII code for the same uncompressed
datafile would represent only 170 bits, as can also be easily verified. The compressed
file thus turns out to be 40% bigger than the uncompressed one! The conclusion is that
7
This consideration illustrates the interest of acronyms. Their primary use is to save text space, easing up
reading and avoiding burdensome redundancies. This is particularly true with technical papers, where the
publication space is usually limited. An equally important use of acronyms is to capture abstract concepts into
small groups of characters, for instance ADSL (asymmetric digital subscriber line) or HTML (hypertext
markup language). The most popular acronyms are the ones that are easy to remember, such as FAQ
(frequently asked questions), IMHO (in my humble opinion), WYSIWYG (what you see is what you get),
NIMBY (not in my backyard), and the champion VERONICA (very easy rodent-oriented netwide index to
computerized archives), for instance. The repeated use of acronyms makes them progressively accepted as
true English words or generic brand names, to the point that their original character-to-word correspondence
is eventually forgotten by their users, for instance: PC for personal computer, GSM for global system for
mobile [communications], LASER for light amplification by stimulated emission of radiation, NASDAQ for
National Association of Securities Dealers Automated Quotations, etc. Language may thus act a natural self-
compression machine, which uses the human mind as a convenient dictionary. In practice, this dictionary is
only rarely referred to, since the acronym gains its own meaning by repeated use.