156 Optimal coding and compression
where p
min
is the smallest source-symbol probability. The Huffman coding efficiency,
η = H/
ˆ
L, is, thus, guaranteed to be strictly greater than 1 −ρ/
ˆ
L. In the example of
Table 9.1, for instance, we find
ˆ
L − H = 4.212 – 4.184 = 0.028, which is lower than
ρ
bound
= p
min
+ 0.08607 ≡ 0.001 + 0.08607 = 0.0877, and η = 99.33%, which is
greater than 1 −0.0877/4.212 = 97.9%. Even tighter redundancy bounds can also be
found for specific probability distributions.
6
9.2 Data compression
Any information source, such as texts, messages, recordings, data files, voice or music
recordings, still or motion pictures, can be coded into sequences of binary codewords. As
we have learnt, the encoding conversion can use either fixed-length or variable-length
codes. In the first case, the bit size of the resulting sequence length is a multiple of
the codeword length. In the second case, the sequence can be made shorter through an
adequate choice of code, or through a second coding conversion, from a first fixed-length
code to a better, variable-length code. Converting a binary codeword sequence into a
shorter one is called data compression.
Since the advent of the Internet and electronic mail, we know from direct personal
experience what data compression means: if compression is not used, it may seem to take
forever to upload or download files like text, slides, or pictures. Put simply, compression
is the art of packing a maximum of data into a smallest number of bits. It can be
achieved by the proper and optimal choice of variable-length codes. As we have learnt,
Shannon–Fano and Huffman codes are both optimal, since, in the general case, their
mean codeword length is shorter than the fixed-length value given by L = log
2
N (N =
number of source symbols to encode).
The compression rate is defined by taking a fixed-length code as a reference, for
instance the ASCII code with its 7 bit/symbol codewords. If, with a given symbol source,
a variable-length code yields a shorter mean codeword length L < 7, the compression
rate is defined as r = 1 − L/7. The compression rate should not be confused with coding
efficiency. The latter measures how close the mean codeword length is to the source
entropy, which we know to be the ultimate lower bound. This observation intuitively
suggests that efficient codes do not lend themselves to efficient compression. Yet, this
correct conclusion does not mean that further compression towards the limits given by
the source entropy is impossible, as we will see in the last section in this chapter, which
is concerned with block coding.
The previous examples with the A–Z English-character source shown in Table 8.5
(Shannon–Fano coding) and Table 9.1 (Huffman coding) represent good illustrations
of the data compression concept. Here, we shall take as a comparative reference the
codeword length of five bits used by ASCII to designate the restricted set of lower case
or upper case letters. With the results from the two aforementioned examples, namely
6
See: R. G. Gallager, Variations on a theme by Huffman. IEEE Trans. Inform. Theory, 24 (1978), 668–74.