$29
CSE 250a. Assignment 9
Reading: Sutton & Barto, Chapters 3.1-4.4
9.1 Effective horizon time
Consider a Markov decision process (MDP) whose rewards rt ∈ [0, 1] are bounded between zero and one.
Let h = (1 − γ)
−1 define an effective horizon time in terms of the discount factor 0≤γ <1. Consider the
approximation to the (infinite horizon) discounted return,
r0 + γr1 + γ
2
r2 + γ
3
r3 + γ
4
r4 + . . . ,
obtained by neglecting rewards from some time t and beyond. Recalling that log γ ≤ γ−1, show that the
error from such an approximation decays exponentially as:
X
n≥t
γ
n
rn ≤ he−t/h
.
Thus, we can view MDPs with discounted returns as similar to MDPs with finite horizons, where the finite
horizon h = (1 − γ)
−1 grows as γ → 1. This is a useful intuition for proving the convergence of many
algorithms in reinforcement learning.
9.2 Three-state, two-action MDP
Consider the Markov decision process (MDP) with three states s∈ {1, 2, 3}, two actions a∈ {↑, ↓}, discount
factor γ =
2
3
, and rewards and transition matrices as shown below:
s R(s)
1 -15
2 30
3 -25
s s
0 P(s
0
|s, a =↑)
1 1
3
4
1 2
1
4
1 3 0
2 1
1
2
2 2
1
2
2 3 0
3 1 0
3 2
3
4
3 3
1
4
s s
0 P(s
0
|s, a =↓)
1 1
1
4
1 2
3
4
1 3 0
2 1 0
2 2
1
2
2 3
1
2
3 1 0
3 2
1
4
3 3
3
4
1
(a) Policy evaluation
Consider the policy π that chooses the action shown in each state. For this policy, solve the linear
system of Bellman equations (by hand) to compute the state-value function V
π
(s) for s ∈ {1, 2, 3}.
Your answers should complete the following table. (Hint: the missing entries are integers.) Show
your work for full credit.
s π(s) V
π
(s)
1 ↑ -18
2 ↑
3 ↓
(b) Policy improvement
Compute the greedy policy π
0
(s) with respect to the state-value function V
π
(s) from part (a). Your
answers should complete the following table. Show your work for full credit.
s π(s) π
0
(s)
1 ↑
2 ↑
3 ↓
9.3 Value function for a random walk
Consider a Markov decision process (MDP) with discrete states s∈ {0, 1, 2, . . . , ∞} and rewards R(s) =s
that grow linearly as a function of the state. Also, consider a policy π whose action in each state either
leaves the state unchanged or yields a transition to the next highest state:
P(s
0
|s, π(s)) =
2
3
if s
0 = s
1
3
if s
0 = s+1
0 otherwise
Intuitively, this policy can be viewed as a right-drifting random walk. As usual, the value function for this
policy, V
π
(s) = EP∞
t=0 γ
tR(st)
s0 = s
, is defined as the expected sum of discounted rewards starting
from state s, with discount factor 0 ≤ γ < 1.
(a) Assume that the value function V
π
(s) satisfies a Bellman equation analogous to the one in MDPs
with finite state spaces. Write down the Bellman equation satisfied by V
π
(s).
(b) Show that one possible solution to the Bellman equation in part (a) is given by the linear form
V
π
(s) = as + b, where a and b are coefficients that you should express in terms of the discount
factor γ. (Hint: substitute this solution into both sides of the Bellman equation, and solve for a and b
by requiring that both sides are equal for all values of s.)
2
9.4 Policy and value iteration
In this problem, you will use policy and value iteration to find the optimal policy of the MDP demonstrated
in class. This MDP has |S| = 81 states, |A| = 4 actions, and discount factor γ = 0.9925. Download the
ASCII files on the course web site that store the transition matrices and reward function for this MDP. The
transition matrices are stored in a sparse format, listing only the row and column indices with non-zero
values; if loaded correctly, the rows of these matrices should sum to one.
(a) Compute the optimal policy π
∗
(s) and optimal value function V
∗
(s) of the MDP using the method of
policy iteration. (i) Examine the non-zero values of V
∗
(s), and compare your answer to the numbered
maze shown below. The correct solution will have positive values at all the numbered squares and
negative values at all the squares with dragons. Fill in the correspondingly numbered squares of the
maze with your answers for the optimal value function. Turn in a copy of your solution for V
∗
(s) as
visualized in this way. (ii) Interpret the four actions in this MDP as (attempted) moves to the WEST,
NORTH, EAST, and SOUTH. Fill in the correspondingly numbered squares of the maze (on a separate
print-out) with arrows that point in the directions prescribed by the optimal policy. Turn in a copy of
your solution for π
∗
(s) as visualized in this way.
(b) Compute the optimal state value function V
∗
(s) using the method of value iteration. For the numbered
squares in the maze, does it agree with your result from part (a)? (It should.) Use this check to make
sure that your answers from value iteration are correct to at least two decimal places.
(c) Turn in your source code for the above questions. As usual, you may program in the language of
your choice.
3
9.5 Convergence of iterative policy evaluation
Consider an MDP with transition matrices P(s
0
|s, a) and reward function R(s). In class, we showed
that the state value function V
π
(s) for a fixed policy π satisfies the Bellman equation:
V
π
(s) = R(s) + γ
X
s
0
P(s
0
|s, π(s))V
π
(s
0
).
For γ < 1, one method to solve this set of linear equations is by iteration. Initialize the state value
function by V0(s)= 0 for all states s. The update rule at the k
th iteration is given by:
Vk+1(s) ← R(s) + γ
X
s
0
P(s
0
|s, π(s))Vk(s
0
).
Use a contraction mapping to derive an upper bound on the error
∆k = maxs |Vk(s) − V
π
(s)|
after k iterations of the update rule. Your result should show that the error ∆k decays exponentially
fast in the number of iterations, k, and hence that limk→∞ Vk(s)=V
π
(s) for all states s.
9.6 Stochastic approximation
In this problem, you will analyze an incremental update rule for estimating the mean µ = E[X] of a
random variable X from samples {x1, x2, x3, . . .}. Consider the incremental update rule:
µk ← µk−1 + αk(xk − µk−1),
with the initial condition µ0 = 0 and step sizes αk =
1
k
.
(a) Show that the step sizes αk =
1
k
obey the conditions (i) P∞
k=1 αk = ∞ and (ii) P∞
k=1 α
2
k < ∞,
thus guaranteeing that the update rule converges to the true mean. (You are not required to prove
convergence.)
(b) Show that for this particular choice of step sizes, the incremental update rule yields the same
result as the sample average: µk =
1
k
(x1 + x2 + . . . + xk). Hint: use induction.
9.7 Course evaluation
Please complete the online course evaluation forms. The forms are completely anonymous, and the
summaries of evaluations are not made available to instructors until after grades are submitted. The
results will play an important role in the future of the course—whether it continues to be offered in the
fall, whether it continues to be allocated resources for two sections, whether it continues to be open
to students outside CSE, whether it continues to be taught by the same instructor, etc.
4