Desurvire E. Classical and Quantum Information Theory: An Introduction for the Telecom Scientist

Подождите немного. Документ загружается.

Maximum entropy of discrete sources 577

0.0

0.1

0.2

0.3

0.4

0.5

12345

0.1 3.9 7.1 11.5 15.3

)(

iii

mxpp

Figure C1 PDF obtained with |µ|=0.1, k = 5, and m

= 0.1, m

= 3.9, m

= 7.1,

= 11.5, m

= 15.3.

(a) m

, m

,...,m

are ordered positive real numbers with m

= 0;

(b) k is ﬁnite;

In this case, the PDF solution is

−m

|µ|



i=1

−m

|µ|

. (C21)

Figure C1 shows a plot of this PDF obtained, for example, with |µ|=0.1, k = 5, and

= 0.1, m

= 3.9, m

= 7.1, m

= 11.5, m

= 15.3.

In physics, this distribution characterizes the atomic populations of electrons within

a set of k energy levels, when the atoms are in a state of thermal equilibrium. To be

speciﬁc, let µ =−hν/k

T , with:

hν = photon energy at frequency ν (h = Planck’s constant),

T = phonon energy at absolute temperature T (k

= Boltzmann’s constant),

= energy of atomic level i divided by hν.

The distribution in Eq. (C21) then takes the form:

−m

hν



i=1

−m

hν

, (C22)

which is known as the Maxwell–Boltzmann distribution. This distribution shows that at

thermal equilibrium, the electron population is highest in the lowest energy level, and

decreases exponentially as the energy of the level increases. In particular, the population

578 Appendix C

ratio between two atomic levels i and j is given by

−m

hν

−m

hν

−(m

−m

)

hν

≡

−

E

, (C23)

where E

= (m

− m

)hν is the energy difference between the two levels. The inter-

esting conclusion of this analysis is that at thermal equilibrium electrons randomly

occupy the atomic energy levels according to a law of maximal entropy. It can be shown

that:

max

=m

hν

exp



hν



− 1

(C24)

This results expresses in nats (1 nat = 1.44 bit) the average information contained in

an atomic system at temperature T with energy levels separated by E = hν, with

hν  k

T . The same result holds for a two-level atomic system (k = 2). The quantity

mhν represents the mean thermal energy stored in the atomic system. The ratio

mhν/k

T represents the mean number of thermal phonons required to bring the atom

into this mean-energy state.

By deﬁnition, the source entropy is H =−



logp

. By convention, we take here the natural logarithm

so that the unity of entropy is the nat. With solution p

= Q

/P (and P =



), we obtain:

H =−



log

=−



(log Q

− log P)

=−



log Q +



log P,

which gives H =−mlog Q + log P. We now develop the second term:

log P = log



i=1

= log(Q

+ Q

+···+Q

)

= log[Q

(1 + Q

−m

+···+Q

−m

)].

Substituting Q = exp(−

hν

)andE

= (m

− m

)hν into the preceding, we obtain:

H =m

hν

+ log



exp



−m

hν



1 +exp



−

E



+···+exp



−

E

%

For simplicity, we can assume that the energy levels are all equidistant, i.e. E

≡ E = hν and E

≡

(k − 1)hν. Using the geometric series formula 1 +q + q

+···+q

k−1

= (1 −q

/1 −q), the entropy is:

H =m

hν

+ log







1 −exp



−k

hν



1 −exp



−

hν









Maximum entropy of discrete sources 579

Discrete distributions maximizing entropy under additional constraints

In this last section, we consider a general method of deriving discrete PDFs, which

maximize entropy while being subject to an arbitrary number of constraints.

Assume, for instance, that the constraints correspond to the different PDF moments

x

=1, x=N, x

,...,x

, which can be expressed according to:

⎧

⎪

⎨

⎪

⎩

= 1 −



= 0

= N −



= 0

=x

−



= 0

...

=x

−



= 0.

(C25)

The functional f to be minimized with the Lagrange multipliers λ

,λ

,...,λ

then

f = H (X) + λ



+ λ



+ λ



+···+λ



. (C26)

Taking the derivative of f with respect to p

yields the development:



H(X ) + λ



+ λ



+ λ



+···+λ





=−



log p

+ λ



+ λ



+ λ



+···+λ



=−log p

− 1 + λ

+ λ

+···+λ

= 0, (C27)

Considering a two-level atomic system (k = 2), we have, in particular;

H =m

hν

+ log

⎡

⎢

⎣

1 − exp



−2

hν



1 − exp



−

hν



⎤

⎥

⎦

=m

hν

+ log



1 + exp



−

hν



If we assume as well that hν/(k

T ) is large enough that the exponential can be neglected, we obtain

H ≈m

hν

, which is valid for all systems with k ≥ 2. It is then straightforward to determine m, from

the deﬁnition m=m

+ m

. The result is the well known “mean occupation number” of Boltzmann’s

distribution:

m=

exp



hν



− 1

580 Appendix C

which corresponds to the general PDF solution

= exp



− 1 + λ

+ λ

+···+λ



, (C28)

= A

...A

(C29)

with

⎧

⎪

⎨

⎪

⎩

= exp(λ

− 1)

= exp(λ

)

= exp(λ

)

...

= exp(λ

(C30)

The solutions in Eq. (C29) and (C30) for λ

,λ

,...,λ

using the n + 1 constraints

in Eq. (C25) can only be found numerically. An example of a resolution method and

its PDF solution in the case n = 2, the event space X being the set of integer numbers,

can be found in.

In these references, it is shown that the photon statistics of optically

ampliﬁed coherent light (i.e., laser light passed through an optical ampliﬁer) is very

close to the PDF solution of maximal entropy.

It is straightforward to show that in the general case, the maximum entropy is given

by the following analytical formula:

max

= 1 − (λ

+ λ

x+λ

x

+···+λ

x

)

= 1 −



i=0

x

.

(C31)

Further discussion and extensions of the continuous PDF case of the entropy-

maximization problem can be found in.

E. Desurvire, How close to maximum entropy is ampliﬁed coherent light? Opt. Fiber Technol., 6 (2000),

357. E. Desurvire, Erbium-Doped Fiber Ampliﬁers, Device and System Developments (New York: John

Wiley & Sons, 2002), Ch. 3, p. 202.

We have

H =−



log p

=−



log



...A



=−





log A

+ x

log A

+ x

log A

+···+x

log A



=−



log A



+ log A



+ log A



+···+log A





=−(λ

− 1 + λ

x +λ

x

+···+λ

x

)

T. M. Cover and J. A. Thomas, Elements of Information Theory (New York: John Wiley & Sons, 1991),

Ch. 11, p. 266.

Appendix D (Chapter 5) Markov chains

and the second law of

thermodynamics

In this appendix, I shall ﬁrst introduce the concept of Markov chains, then use it with

the results of Chapter 5 concerning relative entropy (or Kullback–Leibler distance) to

describe the second law of thermodynamics.

Markov chains and their properties

Consider a source X of N random events x with probability p(x). If we look at a

succession of these events over time, we then observe a series of individual outcomes,

which can be labeled x

(i = 1 ...n), with x

∈ X

. The resulting series, which is, thus,

denoted x

...x

, forms what is called a stochastic process.

Such a process can be characterized by the joint probability distribution

p(x

, x

,...,x

). In this deﬁnition, the ﬁrst argument x

represents the outcome

observed at time t = t

, the second represents the outcome observed at time t = t

and so on, until observation time t = t

. Then p(x

, x

,...,x

) is the probability of

observing x

, then x

, etc., until x

. If we repeat the observation of the n events, but

now starting from any time t

(q > 1), we shall obtain the series labeled x

1+q

...x

n+q

which corresponds to the joint distribution p(x

1+q

, x

2+q

,...,x

n+q

). By deﬁnition, the

stochastic process is said to be stationary if for any q we have

p(x

1+q

, x

2+q

,...,x

n+q

) = p(x

, x

,...,x

), (D1)

meaning that the joint distribution is invariant with time translation. Note that such an

invariance does not mean that x

1+q

= x

, x

2+q

= x

, and so on! The property only means

that the joint probability is time invariant, or does not depend at what time we start the

observation and which time intervals we use between two observations.

What is a Markov process? Simply deﬁned, it is a chain process where the event

outcome at time t

n+1

is only a function of the outcome at time t

, and not of any other

preceding events. Such a property can be written formally as:

p(x

n+1

, x

n−1

,...,x

) ≡ p(x

n+1

). (D2)

This means that the event x

n+1

is statistically independent, in the strictest sense,

from all preceding events but x

. Using Bayes’s formula and the above property,

582 Appendix D

we get:







p(x

, x

) = p(x

)p(x

)

p(x

, x

) = p(x

, x

)p(x

, x

) = p(x

)p(x

)

etc.,

(D3)

and consequently

p(x

, x

,...,x

) = p(x

n−1

)p(x

n−1

n−2

) ... p(x

)p(x

). (D4)

A Markov chain is said to be time invariant if the conditional probabilities p(x

n−1

)do

not depend on the time index n, i.e., they are themselves time invariant. For instance, if a

and b are two speciﬁc outcomes, we have p(x

= b|x

n−1

= a) = p(x

n−1

= b|x

n−2

= a)

=···=p(x

= b|x

= a). If we recall the property of conditional probabilities:

p(y) =



x∈X

p(y|x) p(x), (D5)

then we have for time-invariant Markov chains (as applying to any time t

n+1

p(x

n+1

) =



∈X

p(x

n+1

)p(x

), (D6)

or equivalently

p(x

n+1

) =



∈X

p(x

n+1

, (D7)

where we deﬁne P

n+1

≡ p(x

n+1

) as being the coefﬁcients of a certain transition

matrix P (note the reverse order of the coefﬁcient subscripts). Such a transition matrix

uniquely deﬁnes the Markov chain, and deﬁnes the evolution of any other probability

distribution q, namely:

q(x

n+1

) =



∈X

q(x

n+1

. (D8)

The expressions in Eqs. (D7)or(D8) correspond to a matrix-vector equation. The matrix

P is, thus, applied to transform the N -vector of coordinates p(x = x

), x ∈ X, which we

call µ. The result of such a transformation is an N -vector of coordinates p(x = x

n+1

x ∈ X, which we call µ



. The matrix-vector equation (Eq. (D7)) is, thus, summarized in

the form:



= µP. (D9)

To take a practical example, consider the 2 × 2 transition matrix that corresponds to a

two-state Markov chain (X being made of two events):

P =







α 1 −α

β 1 − β



, (D10)

Markov chains and the second law of thermodynamics 583

where α, β are real constants. This means that this Markov process is time invariant.

Replacing this deﬁnition in Eq. (D10), we obtain:



≡ (µ



,µ



)

= µP

= (µ

,µ

)



α 1 − α

β 1 −β



≡ [αµ

+ βµ

, (1 − α)µ

+ (1 −β)µ

(D11)

Since the input coordinates satisfy µ

+ µ

= 1 (being probabilities), we observe that

the sum of the output coordinates is also unity, µ



+ µ



= µ

+ µ

= 1, which justiﬁes

our choice for the time-invariant transition matrix P (it is easily shown that this is

actually the only one).

This example will help us to illustrate yet another important concept. We have seen

that a Markov process can be time invariant, meaning that the transition matrix has

constant or unchanging coefﬁcients. But this time invariance does not mean that the

probability distribution does not change over time: we have just seen from our previous

example that in the general case µ



= µ, which means that p(x) at time t

n+1

is generally

different from p(x) at time t

. But nothing forbids the distribution from remaining

unchanged over time. By deﬁnition, we shall say that the distribution µ is stationary if

the following property is satisﬁed:



= µP = µ. (D12)

With the previous example, it is easily established that the stationary solution satisﬁes:











= p(x

) =

1 − α + β

= p(x

) =

1 − α

1 − α + β

(D13)

with the condition α − β = 1. Such a distribution is of the type p(x

) = β/M and

p(x

) = 1 − β/M, where M = 1 −α + β, meaning that it is generally nonuniform.

The speciﬁc case of a uniform stationary distribution is given by β = M/2, which gives

p(x

) = p(x

) = 1/2.

The lesson learnt from the above example is that time-invariant Markov processes

have stationary solutions. If the process is initiated at time t

with a stationary solution,

then the process is also stationary, meaning that the probability distribution p(x) at time

n+1

is the same as at time t

or t

n−1

or t

. Note that such a stationary solution is not

necessarily unique. Two conditions for uniqueness of the stationary solution,

which we

will assume here without demonstration, are:

(a) The process is aperiodic (i.e., the evolution of p(x) does not show periodic oscilla-

tions with equal or increasing amplitudes);

(b) There exists a nonzero probability that the variable x ∈ X will be reached within a

ﬁnite number of steps (the process is then said to be irreducible).

T. M. Cover and J. A. Thomas, Elements of Information Theory (New York: John Wiley & Sons, 1991),

Ch. 2.

584 Appendix D

Under these two conditions, the stationary solution is unique. Moreover, the distri-

bution at time t

in the limit n →∞asymptotically converges towards the stationary

solution, regardless of the initial distribution at time t

. This property will be demon-

strated in the second part of this appendix.

Assuming that the conditions of uniqueness are satisﬁed in the previous example, the

entropy H(X )

t=t

converges towards the limit:

H(X )

t=t

∞

≡ H

∞

=−µ

log µ

− µ

log µ

=−

log

−



1 −



log



1 −



(D14)

It is easily veriﬁed that when the stationary solution is uniform (β = M/2), then

∞

= H

max

= log 2 ≡ 1 bit/symbol, which represents the maximum possible entropy

for a two-state distribution (Chapter 4). In the general case where the stationary solution

is nonuniform (β = M/2), we have, therefore, H

∞

< H

max

. This means that the system

evolves towards an entropy limit that is lower than the maximum. Here comes the inter-

esting conclusion for this ﬁrst part of the appendix: assuming that the initial distribution

is uniform and the stationary solution nonuniform, the entropy will converge to a value

∞

< H

max

= H (X)

t=t

. This result means that the entropy of the system decreases

over time, in apparent contradiction with the second law of thermodynamics. Such a

contradiction is lifted by the argument that a real physical system has no reason to be

initiated with a uniform distribution, giving maximum entropy for initial conditions. In

this case, and if the stationary distribution is uniform, then the entropy will grow over

time, which represents a simpliﬁed version of the second law, as we shall see in the

second part. Note that the stationary distribution does not need to be uniform for the

entropy to increase. The condition H

∞

> H (X)

t=t

is sufﬁcient, and it is in the domain

of physics, not mathematics, to prove that such a condition is representative of real

physical systems.

Proving the second law of thermodynamics

The second part of this appendix provides an elegant information-theory proof of the

second law of thermodynamics.

The tool used to establish this proof is the concept

of relative entropy, also called the Kullback–Leibler distance, which was introduced in

Chapter 5.

Considering two joint probability distributions p(x, y), q(x, y), the relative entropy

is deﬁned as the quantity:

D[ p(x, y)q(x, y)] =

log

p(x, y)

q(x, y)

X,Y



x∈X



y∈Y

p(x, y)log

p(x, y)

q(x, y)

(D15)

T. M. Cover and J. A. Thomas, Elements of Information Theory (New York: John Wiley & Sons, 1991),

Ch. 2.

Markov chains and the second law of thermodynamics 585

In particular, it was shown that the relative entropy obeys the chain rule:

D[ p(x, y)q(x, y)] = D[p(x)q(x)] + D[ p(y|x)q(y|x)]

= D[ p(y)q(y)] + D[ p(x|y)q(x|y)],

(D16)

where D[p(.|.)q(.|.)] is a conditional relative entropy. Finally, an important property

is that the relative entropy is always positive (regardless of the arguments being joint

or conditional probabilities), except in the speciﬁc case p = q, where it is zero (thus,

D[ pq] > 0if p = q and D[ pp] = D[qq] = 0).

We shall apply the above properties to the case of Markov chains. In this analysis, the

variables x

and x

n+1

are substituted for the variables x and y, which deﬁne the system

events from a single source X that can be observed at two successive instants (x

, x

n+1

∈ X ). Let us assume now that the system evolution is characterized by a time-invariant

Markov process. Such a process is deﬁned by a unique transition probability matrix R,

which has the time-independent elements R

n+1

= r(x

n+1

). Consistently with the

property in Eq. (D2), the conditional probabilities are uniquely deﬁned for p and q:



p(x

n+1

) ≡ r(x

n+1

)

q(x

n+1

) ≡ r(x

n+1

(D17)

Next, we apply the chain rule in Eq. (D16):

D[ p(x

n+1

, x

)q(x

n+1

, x

)] = D[ p(x

)q(x

)] + D[ p(x

n+1

)q(x

n+1

)]

= D[ p(x

x+1

)q(x

x+1

)] + D[ p(x

n+1

)q(x

n+1

)].

(D18)

Substituting Eq. (D17)inEq.(D18), we obtain

D[ p(x

)q(x

)] + D[r(x

n+1

)r(x

n+1

)]

= D[ p(x

x+1

)q(x

x+1

)] + D[ p(x

n+1

)q(x

n+1

)],

(D19)

or equivalently, since D[rr] = 0:

D[ p(x

x+1

)q(x

x+1

)] = D[ p(x

)q(x

)] − D[ p(x

n+1

)q(x

n+1

). (D20)

Considering the property D[pq] ≥ 0, Eq. (D20) shows that D[ p(x

x+1

)q(x

x+1

)] ≤

D[ p(x

)q(x

)]. This result means that in a time-invariant Markov process, the relative

entropy or distance between any two distributions can only decrease over time.

In particular, we can choose q = q

to be a stationary solution of the Markov process.

If this solution is unique, then its distance for any other distribution p decreases over

time. This means that p converges to the asymptotic limit deﬁned by q

(it can be shown,

although it is not straightforward, that D[pq

] = 0or p ≈ q

in this limit).

Assume next that the stationary solution of the Markov process is a uniform distri-

bution, which we shall call u

(namely, u

(x) = 1/N , x ∈ X). From the deﬁnition of

586 Appendix D

distance (Eq. (D15) applied to single-variable distributions), we obtain

D[ pu

)] =



x∈X

p log

1/N



x∈X

p log p +



x∈X

p log N

≡ H

max

− H (X),

(D21)

with H

max

= log N. Since the distance decreases over time while staying positive, the

above result means that the system entropy H(X ) increases over time towards the upper

limit H

max

This demonstration could be considered to represent one of several possible proofs of

the second law of thermodynamics. We should not conclude that the second law implies

that the stationary solution of any physical system must be uniform! What was shown is

simply that this condition is sufﬁcient, short of being necessary.