410 Discounted Dynamic Programming Under Uncertainty
For a useful summary of some of the properties of upper semicontin-
uous functions, see Royden (1968, pp. 195–196).
Theorem 4.1 was proved by Maitra (1968) under the assumptions
[A.1], [A.2], [A.4] and [A.3
]. The reward function u is a bounded, upper
semicontinuous function on S × A.
6.6.2 The Controlled Semi-Markov Model
The dynamic programming problem we consider is specified by (1) a
state space S (a Borel subset of a complete separable metric space), (2) an
action space A (a compact metric space), (3) a reward rate r (x, a), which
accrues in state x when action a is taken, (4) a distribution γ (du | x, a)of
the holding time in state x when an action a is taken, and (5) a transition
probability q(dz | x, a) of the new state z which occurs at the end of this
holding period in state x.
Informally, a policy is a rule that determines which action to take based
on past actions and past and present states. At the kth stage, the policy
chooses an action
ˆ
a
k
= f
k
(X
0
,
ˆ
a
0
,...,X
k−1
,
ˆ
a
k−1
, X
k
)
based on past actions
ˆ
a
0
,...,
ˆ
a
k−1
and past and present states
X
0
,...,X
k
. The evolution of the state with time, for a given policy
f ={f
0
, f
1
,..., f
k
,...}, may be described as follows. An initial state
X
0
= x
0
is given, and an action
ˆ
a
0
= f
0
(X
0
) is chosen; the state re-
mains at X
0
for a random time T
0
whose distribution is γ (du |X
0
,
ˆ
a
1
),
given X
0
and
ˆ
a
0
. After time T
0
a new state X
1
occurs with distribu-
tion q(dz | X
0
,
ˆ
a
0
), and, observing X
1
, the action
ˆ
a
1
= f
1
(X
0
,
ˆ
a
0
, X
1
)
is chosen; the state remains at X
1
for a period T
1
with distribution
γ (du | X
1
,
ˆ
a
1
), at the end of which (i.e., at time T
0
+ T
1
) a new state
X
2
occurs having distribution q(dz | X
1
,
ˆ
a
1
), conditionally given X
1
,
ˆ
a
1
.
This continues indefinitely.
We now give a more formal description.
A policy is a sequence of functions ζ ={f
0
, f
1
, f
2
,...}: f
0
is a Borel
measurable map on S into A, f
1
is a Borel measurable map on S × A × S
into A,..., f
k
is a Borel measurable map on S × A × S × A ×···×
S × A × S = (S × A)
k
× S into A. A policy ζ is Markovian if, for each
k, f
k
depends only on the last coordinate among its arguments, i.e., each
f
k
is a Borel measurable map on S into A . A policy ζ is stationary if