Desurvire E. Classical and Quantum Information Theory: An Introduction for the Telecom Scientist

Подождите немного. Документ загружается.

6.1 Entropy of continuous sources 87

The various relations and properties that were obtained in the discrete case between

entropy, conditional entropy, joint entropy, relative entropy, and mutual information

also apply to the continuous case. To recall, for convenience, these relations and

properties are:



H(X, Y ) = H (Y |X) + H (X)

H(X, Y ) = H (X|Y ) + H (Y ),

(6.9)

H(X ; Y ) = H (X) − H (X |Y )

= H (Y ) − H (Y |X) (6.10)

= H (X) + H(Y ) − H(X, Y ),

D(X, Y ) = H(X, Y ) − H(X ; Y )

= H (X|Y ) + H (Y |X), (6.11)

D( pq) ≥ 0, (6.12)

H(X ; Y ) = D[p(x, y)p(x) p(y)] ≥ 0, (6.13)

D[ p(x, y)q(x, y)] = D[p(x)q(x)] + D[ p(y|x)q(y|x)]. (6.14)

In particular, it follows from Eqs. (6.13) and (6.10) that H(X |Y ) ≤ H (X ) and

H(Y |X ) ≤ H (Y ), with equality if the sources are independent. Thus for continuous

sources, conditioning reduces differential entropy, just as in the discrete case.

I shall describe next a few examples of PDFs lending themselves to closed-form, or

analytical deﬁnitions of differential entropy, with some illustrations regarding relative

entropy and KL distance.

Consider ﬁrst the continuous uniform distribution, deﬁned over the real interval of

width u = b − a,Eq.(6.8). As we have seen, the corresponding bit/symbol entropy

is H

uniform

= log

(b −a). We note that the entropy H

uniform

is nonpositive if u ≤ 1.

The result shows that the entropy of a continuous uniform distribution of width u

increases as the logarithm of u. In the particular cases where u = 2

, with N being

integer, then H

uniform

= N bit/symbol. In the limit u, N →∞,orp(x) = 1/u = 2

−N

→

0, corresponding to a uniform distribution of inﬁnite width, the entropy is inﬁnite,

corresponding to an inﬁnite number of degrees of freedom for source events having

themselves an inﬁnite information, I (x) =−log[ p(x)] = N bits. We thus observe that,

short of any constraints on the deﬁnition interval, or PDF mean, the entropy is unbounded,

or H

uniform

→+∞as N →∞.

We may compute the relative entropy, or KL distance, between any continuous PDF

p(x) deﬁned over the domain X = [a, b] with u = (b − a) = 2

and the corresponding

uniform PDF, which we now call q(x) = 1/u = 2

−N

, according to Eq. (6.5):

D[ p(x)q(x)] =



p(x)log

p(x)

q(x)



p(x)log



p(x)





p(x)



log(2

) + log p(x)



dx (6.15)

88 Differential entropy

= N



p(x)dx +



p(x)logp(x)dx

≡ N − H

(X ),

where H

(X ) is the entropy of the source X and p(x) is the associated PDF. Since the

KL distance is nonnegative, D( pq) ≥ 0, the above result implies that H

(X ) ≤ N ,

meaning that N = log u represents the upper bound, or maximum entropy for all PDF

deﬁned over X.

Consider next the continuous-exponential distribution, which is deﬁned as p(x) =

λe

−λx

with x ≥ 0 (nonnegative real), λ = 1/x≡1/τ a strictly positive constant, and

τ =x the PDF mean, also called the 1/e lifetime (Chapter 2). An elementary calcu-

lation (see Exercises) while taking the natural logarithm in the entropy deﬁnition yields

the nat/symbol entropy:

exp

= ln

= ln(eτ ) = 1 +ln τ. (6.16)

Thus, the entropy of a continuous-exponential distribution H

exp

can be negative or null

if τ ≤ 1/e and increases as the logarithm of the lifetime τ . Short of any constraints on

the mean or lifetime, the entropy is unbounded, or H

exp

→+∞as τ →∞.

Assuming τ = e

N −1

, where N is an integer, we obtain H

exp

= N nat/symbol, which

compares with the entropy of the uniform distribution, H

uniform

= N bit/symbol, for

a source deﬁned over any real interval x ∈ [a, b] having the width u = (b − a) = 2

We may compute the relative entropy, or KL distance, between any continuous PDF

p(x) deﬁned over the domain x ≥ 0 and the exponential PDF, which we now call

q(x) = λe

−λx

, according to Eq. (6.5), and using natural logarithms:

D[ p(x)q(x)] =



p(x)log

p(x)

q(x)



p(x)ln



p(x)

λe

−λx





p(x)

{

ln[p(x)] −ln(λ) +λx

}

dx (6.17)



p(x)ln[p(x)]dx − ln(λ)



p(x)dx + λ



xp(x)dx

≡ λx

− ln(λ) − H

(X ),

where H

(X ) is the entropy of X with PDF p(x) and x

is the PDF mean value.

From this result, we can determine the relative entropy or KL distance between two

exponential PDFs, i.e., p

exp

(x) = µe

−µx

(µ>0) and q

exp

(x) = λe

−λx

, which gives

D[ p

exp

(x)q

exp

(x)] =

− ln(λ) −ln





≡

− 1 +ln

. (6.18)

6.1 Entropy of continuous sources 89

Setting u = λ/µ,wehaveD = u − ln(eu). It can easily be checked that D reaches a

minimum of zero for u = 1(λ = µ,orp(x) = q(x)), as expected. In the limits λ → 0

(p(x) → φ(x)) or µ → 0(p(x) → φ(x)), the KL distance D is inﬁnite, but φ(x), which

emulates a step or uniform distribution over the interval x ∈

[

0, +∞

]

, is not a valid PDF.

It is left as an exercise to study an exponential PDF variant that is deﬁned over a ﬁnite

interval x ∈ [0, m], and to reconcile the inﬁnite limit with the uniform-distribution case.

Consider next the case of the normal or Gaussian distribution, p(x), as deﬁned in

Eq. (2.21). Using ﬁrst natural logarithms, the differential entropy calculation yields:

normal

=−



p(x)ln



√

2πσ

exp



−

(x − x

)

2σ

%



p(x)

(x − x

)

2σ

dx −



p(x)ln



√

2πσ



dx (6.19)

2σ

(x − x

)

+

ln(2πσ

)



p(x)dx

2σ

ln(2πσ

) ≡

ln(2πeσ

where we used the property (x − x

)

=x

−x

=σ

. In bit/symbol units, this

result gives

normal

≈ 2.047 +0.72 log σ

= 2.047 + 1.44 log σ. (6.20)

We thus ﬁnd that the entropy of the normal or Gaussian distribution can be negative or

zero if σ ≤ 1/

√

2πe, and increases as the logarithm of its variance or deviation. Short

of any constraints on the variance σ , the entropy is unbounded, or H

normal

→+∞as

σ →∞.

We may compute the relative entropy, or KL distance, between any continuous PDF

p(x) and the normal or Gaussian PDF, q

normal

(x), according to Eq. (6.5) with mean and

variance (x

,σ

) and using natural logarithms:

D[ p(x)q

normal

(x)] =



p(x)ln



p(x)

normal

(x)



≡−H

(X ) +



2πσ



2σ



x



− 2x

x

+ x



(6.21)

where H

(X ) is the entropy of the source X with associated PDF p(x). We may apply

the above deﬁnition to the case where p(x) is also a normal or Gaussian PDF with mean

and variance (x

,σ

) and obtain (see Exercises):

[

normal

(x)q

normal

(

)

]





− x



− 1 − ln





. (6.22)

As expected, the relative entropy or KL distance D vanishes if the two PDFs are

identical, (x

,σ

) = (x

,σ

). Setting u = σ

/σ

and v =

[

− x

)/σ

]

,wehave

D = (u

− ln u − 1 +v)/2. It is easily checked that D reaches a minimum for u = 1,

90 Differential entropy

namely, D

min

= v/2, which is zero for v = 0. Also, D is inﬁnite in the two lim-

iting cases u → 0(σ

→ 0) or u →+∞(σ

→ 0), which correspond to p(x) →

δ(x − x

), or q(x) → δ(x − x

), respectively, where δ(x)istheDirac or delta distri-

bution.

There are plenty of continuous-PDF types for which differential entropy comes out

in closed-form or analytical expressions. A list of the most important PDFs comprises:

Uniform,

Normal or Gaussian,

Exponential,

Rayleigh,

Beta,

Cauchy,

Chi,

Chi-squared,

Erlang,

Gamma,

Laplace,

Logistic,

Log-normal,

Maxwell–Boltzmann,

Generalized normal,

Pareto,

Student’s t,

Triangular,

Weibull,

Multivariate normal.

6.2 Maximum entropy (continuous source)

In Chapter 4, we addressed the problem of ﬁnding the maximum entropy. For discrete

sources, this problem can be resolved with or without making restricting assumptions

regarding the PDF, i.e., with or without assuming constraints. Without constraints, the

straightforward result is that maximum entropy is reached with the uniform (discrete)

PDF with N equiprobable events, i.e., p = 1/N yielding H

max

= log N. Introducing

The Dirac or delta distribution, which is not a function (except to the physicists!), satisﬁes the properties

δ(x − u) = 0forx = u, lim

x→u

δ(x − u) =+∞,

+∞



−∞

δ(x)dx = 1, and

+∞



−∞

δ(x − u) f (x)dx = f (u).

See table and PDF deﬁnition links at bottom of web page http://en.wikipedia.org/wiki/Differential_entropy.

See also T. M. Cover and J. A. Thomas, Elements of Information Theory (New York: John Wiley & Sons,

1991), Table 16.1, pp. 486–7.

6.2 Maximum entropy (continuous source) 91

constraints, such as the PDF mean, the entropy is found to be maximized by the discrete-

exponential,orBose–Einstein distribution. We are now facing a similar problem, this

time with continuous distributions and associated differential entropy. The previous

examples in this chapter have illustrated that without constraints (such as the PDF

mean), entropy is, in this case, virtually unbounded. It just takes a normal (Gaussian), a

uniform, or an exponential PDF with the appropriate limits to reach entropy inﬁnity!

In the continuous-PDF case, the issue of maximizing entropy is, therefore, bounded to

the introduction of constraints. Given such constraints, the problem consists in determin-

ing the optimal PDF for which entropy is maximized, this maximum entropy represent-

ing the upper bound for all possible PDFs under the same constraints. Albeit seemingly

abstract or fuzzy, this last statement should become crystal clear after we have gone

through the optimization problem.

The continuous PDF, which maximizes entropy under certain given constraints, can

be found through the Lagrange-multipliers method, as previously used in Chapter 4,

and described in Appendix C for the discrete PDF case. Here we are going to explore

the method more extensively, but let us not be scared, this development should remain

relatively basic and accessible, if the reader or student has made it this far!

First of all, let us see how we can ﬁnd the PDF, which may obey some set of

assumed constraints. Let us also make the problem the most general possible. Let

(x), g

(x),...,g

(x) be the assumed sets of n + 1 constraint functions of the variable

x and their mean values and g

, g

,...,g

 deﬁned by

g

=



(x) p(x)dx. (6.23)

It is implicitly assumed that g

, g

,...,g

 are all ﬁnite, which is an issue to be

discussed later on. The ﬁrst constraint g

(x) = 1 is there to impose the PDF normaliza-

tion condition, or

p(x)dx = 1. Using the entropy deﬁnition in Eq. (6.1), the method

consists in minimizing the “functional” f as deﬁned as:

f =



−p(x)log p(x) +



k=0

(x) p(x)

dx, (6.24)

where λ

,λ

,...,λ

are the unknown Lagrange multipliers. Minimizing f yields:

d f

d p



−p(x)log p(x) +



k=0

(x) p(x)



d p

−p(x)log p(x) +



k=0

(x) p(x)



−log p(x) −1 +λ



k=1

(x)

dx = 0 (6.25)

⇒−log p(x) −1 +λ



k=1

(x) = 0,

92 Differential entropy

which yields the general, but unique, PDF solution:

p(x) = exp

− 1 +



k=1

(x)

. (6.26)

Nicely, the solution is exclusively deﬁned by the discrete set of parameters λ

,λ

,...,λ

which we must now ﬁnd by using the n +1 constraints deﬁned in Eq. (6.23). Integrating

in Eq. (6.24), and equating the result with f = 0, yields the corresponding expression

of maximum entropy:

max

= 1 −

(

+ λ

g

+λ

g

+,...,λ

g



)

. (6.27)

As previously mentioned, however, an implicit condition is that the means

g

, g

,...,g

be ﬁnite. This condition is of consequence in the possible values for

the parameters λ

,λ

,...,λ

, as we see next.

Assume, for instance, that our constraint functions are deﬁned by g

(x) = x

for all

k,org

=x

, which corresponds to the PDF moments of order k. The PDF solution

takes then the form:

p(x) = A exp(λ

x + λ

+ λ

+···+λ

), (6.28)

with A = exp(λ

− 1). Substituting the PDF solution into the constraints in Eq. (6.23)

yields:











p(x)dx = A

exp(λ

x + λ

+ λ

+···+λ

)x = 1

p(x)dx = A

exp(λ

x + λ

+ λ

+···+λ

)dx =x



(k = 1,...,n).

(6.29)

For any of the integrals involved in Eq. (6.29) to converge, a certain number of additional

conditions should be met, depending on the integration domain X and the number of

constraints n + 1:

(a) X is ﬁnite, or X =

[

min

, x

max

]

⇒ no extra conditions;

(b) X has an inﬁnite upper or lower bound:

(i) X =

[

min

, +∞

]

⇒ condition is λ

< 0,

(ii) X =

[

−∞, x

max

]

⇒ condition is λ

< 0.

[

−∞, +∞

]

(i) n is even ⇒ condition is λ

< 0,

(ii) n is odd ⇒ condition are λ

= 0 and λ

n−1

< 0.

The above shows that in the situations corresponding to (a), (b), and (c(ii)), it is possible

to ﬁnd the set of Lagrange parameters λ

,λ

,...,λ

that deﬁne the optimal PDF and

its maximal entropy. Except for n ≤ 2 (as we shall see), this takes an iterative numerical

resolution in which the above assumptions are explicitly introduced from the start in

order to ensure convergence. The situation (c(ii)) poses a speciﬁc problem. Indeed, if

we must set λ

= 0, we are left with n + 1 equations (Eq. (6.29)) to determine the

6.2 Maximum entropy (continuous source) 93

n Lagrange parameters λ

,λ

,...,λ

n−1

. In the general case, this problem cannot be

solved. However, numerical computation makes it possible to ﬁnd an upper bound to

the entropy under said constraints, and to determine an ad-hoc PDF whose entropy

approaches it with arbitrary accuracy. While the exact optimal PDF solution does not

exist, it can yet be approximated with very high numerical precision.

Consider next two simple cases of interest which correspond to g

(x) = x

in the two

situations (b(i)) with n = 1 and x

min

= 0, and (c(i)) with n = 2, respectively.

In the ﬁrst case, the PDF and maximum entropy solutions are:

p(x) = exp(λ

− 1 +λ

x), (6.30)

max

= 1 −(λ

+ λ

x), (6.31)

with the convergence condition λ

< 0.

The solution deﬁned in Eq. (6.30) corresponds to the continuous-exponential dis-

tribution. It is easily established from the constraints in Eq. (6.23) that λ

= 1 −

lnx and λ

=−1/x, which gives p(x) = (1/x)exp(−x/x) and H

max

= ln(ex),

as the expected result, see Eq. (6.16).

In the second case (n = 2), the PDF and maximum entropy solutions are:

p(x) = exp(λ

− 1 +λ

x + λ

), (6.32)

max

= 1 −



+ λ

x+λ

x





, (6.33)

where λ

< 0.

Compared to the previous case, the analytical computation of the three Lagrange

parameters λ

,λ

,...,λ

from the constraints in Eq. (6.23) is a bit tedious, albeit

elementary, which I leave as a good math-training exercise. We eventually ﬁnd











exp(λ

− 1) =



2πσ

exp



−

x

2σ



x

=−

2σ

(6.34)

Substitution of the above Lagrange parameters in Eqs. (6.32)–(6.33) yields:

p(x) =



2πσ

exp

−

(

x −x

)

2σ

(6.35)

and with σ

=x

−x

, as deﬁned in nat/symbol:

max

= ln

(

2πeσ

. (6.36)

We recognize in Eq. (6.35) the deﬁnition of the Gaussian (normal) distribution. It is a

fundamental result in information theory that the optimal continuous PDF for which

entropy is maximized, under a constraint in the ﬁrst two moments (x,σ

), is precisely

the Gaussian (normal) distribution. The entropy of any continuous PDF (as deﬁned over

the event space X =

[

−∞, +∞

]

), therefore, has H

max

= ln



2πeσ

as an upper bound.

94 Differential entropy

The issue of entropy maximization is key to the solution of several problems in

statistical physics and engineering (and many other ﬁelds as well) where the PDF is

unknown, while a certain set of constraints g

(x), g

 are known from real-life or

experimental observation. See previous discussion in Chapter 4 regarding the maximum

entropy principle.

6.3 Exercises

6.1 (B): Calculate the differential entropy of the continuous source deﬁned over the

real interval X =

[

a, b

]

, with a = b, assuming a uniform PDF.

6.2 (B): Calculate the differential entropy associated with an exponential PDF.

6.3 (T): Show that the differential entropy

H(X ) =−



p(x)logp(x)dx

with 0 < p(x) ≤ 1, is always ﬁnite.

6.4 (T): Assume the probability distribution

p(x) = αe

−λx

which is deﬁned for x ∈ [0, m],m,α,λbeing nonnegative real.

(a) Determine the PDF constant α, the PDF mean x and the PDF entropy H .

(b) Show that in the limit λ → 0 the PDF becomes uniformly distributed with

p(x) → 1/m.

q(x) = αe

−λx

and

p(x) = βe

−µx

(d) Determine the KL distance D[ pq] in the limit λ → 0orq(x) → 1/m.

(e) Show that in the limit λ, µ → 0 the distance D[ pq] vanishes.

6.5 (M): Show that the relative entropy, or KL distance, between any continuous PDF

p(x) and the normal (Gaussian PDF),

q(x) =

√

2πσ

exp

−

(

x − x

)

2σ

is given by

D[ p(x)q(x)] =−H

(X ) +

ln(2πσ

) +

2σ



x



− 2x

x

+ x



where H

(X ) is the source entropy associated with the distribution p(x).

6.3 Exercises 95

6.6 (B): Determine the two parameters λ

,λ

and the domain x ∈ X to make the

function

p(x) = exp(λ

− 1 +λ

a probability distribution.

6.7 (T): determine the three parameters λ

,λ

to make the function

p(x) = exp(λ

− 1 +λ

x + λ

)

a probability distribution over the domain x ∈ X =

[

−∞, +∞

]

Clues: where appropriate, effect the variable substitution x = y − λ

/(2λ

), set

=−α(α>0), and use the result

+∞



−∞

exp(−αx

) ≡



π/α.

7 Algorithmic entropy and

Kolmogorov complexity

This chapter will take us into a world very different from all that we have seen so

far concerning Shannon’s information theory. As we shall see, it is a strange world

made of virtual computers (universal Turing machines) and abstract axioms that can be

demonstrated without mathematics merely by the force of logic, as well as relatively

involved formalism. If the mere evocation of Shannon, of information theory, or of

entropy may raise eyebrows in one’s professional circle, how much more so that of

Kolmogorov complexity! This chapter will remove some of the mystery surrounding

“complexity,” also called “algorithmic entropy,” without pretending to uncover it all.

Why address such a subject right here, in the middle of our description of Shannon’s

information theory? Because, as we shall see, algorithmic entropy and Shannon entropy

meet conceptually at some point, to the extent of being asymptotically bounded, even if

they come from totally uncorrelated basic assumptions! This remarkable convergence

between ﬁelds must make integral part of our IT culture, even if this chapter will only

provide a ﬂavor. It may be perceived as being somewhat more difﬁcult or demanding

than the preceding chapters, but the extra investment, as we believe, is well worth it. In

any case, this chapter can be revisited later on, should the reader prefer to keep focused

on Shannon’s theory and move directly to the next stage, without venturing into the

intriguing sidetracks of algorithmic information theory.

7.1 Deﬁning algorithmic entropy

The concept of information, which has been described extensively in Chapter 3,also

evolved beyond Shannon’s view, and independently of his classical theory. An alternative

deﬁnition, which is referred to as algorithmic information, is attributed to G. Chaitin,

R. Solomonoff, and A. Kolmogorov.

Algorithmic information opened the way to the

ﬁeld of algorithmic information theory (AIT).

From AIT’s perspective, any source event x is treated as an object variable. The event

may consist in any symbol sequence, whether random or deterministic. The focus of

AIT is not on the source X, the statistical ensemble of all possible events or symbolic

sequences, but on this particular sequence x.

See, for instance: http://home.mira.net/∼reynella/debate/informat.htm and useful links therein.