4.3 Language entropy 57
4.3 Language entropy
Here, we shall see how Shannon’s entropy can be applied to analyze languages,as
primary sources of word symbols, and to make interesting comparisons between different
types of language.
As opposed to dialects, human languages through history have always possessed some
written counterparts. Most of these language “scripts” are based on a unique, finite-size
alphabet of symbols, which one has been trained in early age to recognize and manipulate
(and sometimes to learn later the hard way!).
7
Here, I will conventionally call symbols
“characters,” and their event set the “alphabet.” This is not to be confused with the
“language source,” which represents the set of all words that can be constructed from
said alphabet. As experienced right from early school, not all characters and words are
born equal. Some characters and words are more likely to be used; some are more rarely
seen. In European tongues, the use of characters such as X or Z is relatively seldom,
while A or E are comparatively more frequent, a fact that we will analyze further down.
However, the statistics of characters and words are also pretty much context-
dependent. Indeed, it is clear that a political speech, a financial report, a mortgage
contract, an inventory of botanical species, a thesis on biochemistry, or a submarine’s
operating manual (etc.), may exhibit statistics quite different from ordinary texts! This
observation does not diminish the fact that, within a language source, (the set of all
possible words, or character arrangements therein), words and characters are not all
treated equal. To reach the fundamentals through a high-level analysis of language, let
us consider just the basic character statistics.
A first observation is that in any language, the probability distribution of characters
(PDF), as based on any literature survey, is not strictly unique. Indeed, the PDF depends
not only on the type of literature surveyed (e.g., newspapers, novels, dictionaries, tech-
nical reports) but also on the contextual epoch. Additionally, in any given language
practiced worldwide, one may expect significant qualitative differences. The Continen-
tal and North-American variations of English, or the French used in Belgium, Quebec,
or Africa, are not strictly the same, owing to the rich variety of local words, expressions,
idioms, and literature.
A possible PDF for English alphabetical characters, which was realized in 1942,
8
is
shown in Fig. 4.2. We first observe from the figure that the discrete PDF nearly obeys an
exponential law. As expected, the space character (sp) is the most frequent (18.7%). It is
followed by the letters E, T, A, O, and N, whose occurrence probabilities decrease from
10.7% and 5.8%. The entropy calculation for this source is H = 4.065 bit/symbol. If we
remove the most likely occurring space character (whose frequency is not meaningful)
from the source alphabet, the entropy increases to H = 4.140 bit/symbol.
7
The “alphabet” of symbols, as meaning here the list of distinct ways of forming characters, or voice sounds,
or word prefixes, roots, and suffixes, or even full words, may yet be quite large, as the phenomenally rich
Chinese and Japanese languages illustrate.
8
F. Pratt, Secret and Urgent (Indianapolis: The Bobbs-Merrill Book Company, 1942). Cited in J. C. Hancock,
An Introduction to the Principles of Communication Theory (New York: McGraw Hill, 1961).