$30
CS7643: Deep Learning
Homework 4
Instructions
1. We will be using Gradescope to collect your assignments. Please read the following instructions
for submitting to Gradescope carefully!
• Each problem/sub-problem should be on one or more pages. This assignment has 5 total
problems/sub-problems, so you should have at least 5 pages in your submission.
• When submitting to Gradescope, make sure to mark which page corresponds to each
problem/sub-problem.
• For the coding problem (problem 5), please follow the instructions in the HW4 webpage
for submitting your jupyter notebook PDFs as well as your code zip files.
• Note: This is a large class and Gradescope’s assignment segmentation features are essential. Failure to follow these instructions may result in parts of your assignment not
being graded. We will not entertain regrading requests for failure to follow instructions.
Please read https://stats200.stanford.edu/gradescope_tips.pdf for additional information on submitting to Gradescope.
2. LATEX’d solutions are strongly encouraged (solution template available at
cc.gatech.edu/classes/AY2019/cs7643_fall/assets/sol0.tex), but scanned handwritten copies
are acceptable. Hard copies are not accepted.
1
1 Optimal Policy and Value Function [15 points]
Consider a simple 2-state MDP shown in Figure 1, S = {S1, S2}. From each state, there are two
available actions A = {stay, go}. The reward received for taking an action at a state is shown in
the figure for each (s, a) pair. Also, taking action go from state S2 ends the episode. All transitions
are deterministic.
Figure 1: 2-state MDP
(a) Consider a policy that always chooses the stay action at every state. Compute the sum of
discounted rewards obtained by this policy starting at state S1, assuming a discount of γ i.e.
compute P∞
t=0
γ
t
rt(st
, at). [4 points]
(b) What is the optimal policy? Write it as a tuple (a1, a2) of the optimal actions at (s1, s2)
respectively. Compute the sum of discounted rewards obtained by this policy, given that the
start state is S1, with discount γ. [4 points]
(c) Recall that one update of the value iteration algorithm performs the following operation,
where r(s, a) denotes the reward for taking action a at state s.
For each s ∈ S :
V
i+1(s) ← max
a
X
s
0
p(s
0
|s, a)
r(s, a) + γV i
(s
0
)
(1)
The value function V is stored as an array of size |S| ( |S| = 2 in our case). For example,
the array V
0 = [0, 1] denotes that V
0
(s1) = 0, V 0
(s2) = 1. Starting with a initial V
0 = [0, 0],
peform the value iteration update to compute V
1
, V 2
, V 3
. What is the optimal V
∗
? [7 points]
2 Value Iteration Convergence [25 points + 5 bonus points]
In part (c) of the previous question, we computed V
i
for a few updates of value iteration and also
V
∗
, the optimal value function. What happened to the "error" in the value estimate at every update
i.e. did |V
i
(s)−V
∗
(s)| every increase for a particular state s? The answer is yes, the value estimate
for any s may fluctuate before eventually converging to V
∗
(s). If this is the case, do we have any
hope that the value iteration algorithm will converge to V
∗
? In this question, we will prove that
value iteration indeed converges due to the monotonic decrease in a different error - the maximum
error over all states max
s∈S
|V
i
(s) − V
∗
(s)| or the L∞ norm
V
i − V
∗
∞
.
(a) For V
1
, V 2
, V 3 obtained in 1(c), compute
V
i − V
∗
∞
and verify that this error decreased
monotonically. [5 points]
2
(b) Let T : R
|S| → R
|S| be the function that takes the vector V
i as input and applies the value
iteration update to produce V
i+1. We will show that for any two vectors V, V 0 ∈ R
|S|, this
operator decreases the L∞ norm between them.
Prove that: kT(V ) − T(V
0
)k∞ ≤ γ kV − V
0k∞ for any V, V 0 ∈ R
|S|
, [10 points]
(c) Now consider the sequence of value functions {V
n} obtained by iteratively performing the
value iteration update as per Eq. 1. This sequence has the interesting property of being a
cauchy sequence i.e. the elements of the sequence become arbitrarily close to each other as the
sequence progresses. Formally, we say a sequence is cauchy w.r.t. the L∞ norm if for every
positive number ?, there is a positive integer N s.t. for all positive integers m, n greater than
N, kxm − xnk∞ < ?. Equipped with this fact, show that the error of V
n+1 w.r.t the optimal
value function can be bounded as: ∀? 0, ∃N 0 s.t. ∀n N
V
n+1 − V
∗
∞
≤
γ
1 − γ
?, (2)
[10 points]
Hint: First try to prove that
V
n+1 − V
∗
∞
≤
γ
1−γ
V
n − V
n+1
∞
, then apply the Cauchy
sequence condition on V
n
Discussion: Given Equation 2, we now have a guarantee that value iteration converges to V
∗
.
However, we made the assumption that {V
n} is a Cauchy sequence. As a bonus question below, we
will prove that the condition in (a) is sufficient to conclude that {V
n} is indeed a cauchy sequence.
(d) [BONUS] Given a function T : R
n → R
n
such that kT(x1) − T(x2)k∞ ≤ α kx1 − x2k∞,
α ∈ [0, 1), prove first that T has a fixed point i.e. ∃x
∗
: T(x
∗
) = x
∗ and second that the fixed
point of T is unique. [5 bonus points]
3 Learning the Model. [20 bonus points]
This is a challenging question with all sub-parts as bonus credit.
While value iteration (VI) allows us to find the optimal policy, performing VI updates (Eq. 1)
requires the transition function T(s, a) := p(s
0
| s, a)
1 and the reward function R(s, a). In many
practical situations, these quantities are unknown. However, we can step through the environment
and observe transitions to collect data of the form (s, a, s0
, r). Using this data, if we can obtain
an estimate of the transition and reward functions, we can hope to still find the optimal policy via
value iteration. Let M denote the true (unknown to us) model of the world (a model consists of
both a transition and reward function), and let Mˆ represent our approximate model of the world
that we estimate from data.
In this question, given estimates of both the transition function T(s, a) = p(s
0
| s, a) ∈ ∆|S|−1
and the reward function R(s, a), we want to study the error of acting optimally in this estimated
MDP as opposed to acting optimally in the “true” MDP. Naturally, we should expect this error to
be smaller as we get better estimates which in turn is proportional to how many observations we
make.
1we will denote T(s, a, s0
) as the probability value of s
0
give (s, a), whereas T(s, a) denotes the entire probability
distribution p(s
0
| s, a)
3
(a) [BONUS] For the first part of this problem, assume that we have used our favorite learning
method to obtain estimates of the transition function Tˆ and the reward function Rˆ. Given
that maxs,a |Rˆ(s, a) − R(s, a)| ≤ ?R and maxs,a ||Tˆ(s, a) − T(s, a)||1 ≤ ?P , then for any policy
π : S → A, ∀s ∈ S, show that [10 bonus points]
||V
π
Mˆ − V
π
M||∞ ≤
?R
1 − γ
+
γ?P
(1 − γ)
2
(3)
Assume that the reward function is in [0, 1].
Hint: 1) Expand VM(s) using the Bellman equation 2) Holder’s Inequality: ||u · v||1 ≤
||u||1 · ||v||∞
(b) [BONUS] We just bounded the error of acting according to any policy π : S → A under the
estimated model, Mˆ vs. the true model, M. Now, we use this to bound the value error in
acting optimally vs. acting optimally according to the estimated model (π
∗
Mˆ
).
Using (3), show that [3 bonus points]
V
∗
M(s) − V
π
∗
Mˆ
M (s) ≤ 2 sup
π:S→A
||V
π
Mˆ − V
π
M||∞ (4)
(c) [BONUS] Given
?R =
r
1
2n
log 4|S × A|
δ
(5)
?P = |S| · r
1
2n
4|S × A × S|
δ
(6)
use (3) to simplify (4) i.e. V
∗
M(s)−V
π
∗
M
M (s). What is the dependence on this error and number
of observations required for each state action pair, n? [2 bonus points]
(d) [BONUS] Given a dataset Ds,a = {(s, a, s0
1
, r1), . . .(s, a, s0
n
, rn)} we estimate the unknown
transition and reward function using empirical frequencies as:
Tˆ(s, a, s0
) = 1
n
Xn
i=1
1(s
0
i = s
0
) (7)
Rˆ(s, a) = 1
n
Xn
i=1
r (8)
For notational convenience, we will use Tˆ(s, a) to denote a vector of size |S| whose element is
the probability computed in (7) for each next state, s
0 ∈ S. Prove that ?R and ?P take values
as in (5) and (6). [5 bonus points]
Hint: Hoeffding’s inequality2
2
https://en.wikipedia.org/wiki/Hoeffding%27s_inequality#General_case_of_bounded_random_variables
4
4 Policy Gradients Variance Reduction [10 points]
In class (will be covered on Oct 29th, but you don’t really need to know the derivation for this
question), we derived the policy gradient as
∇θJ(θ) = ∇θEτ∼πθ
[R(τ )] (9)
= Eτ∼πθ
[R(τ )∇ log πθ(τ )] (10)
≈
1
N
X
N
i=1
R(τi)∇θ log πθ(τi) (11)
where τ is a trajectory3
, R(τ ), the associated reward and N, the number of trials to estimate the
expected reward, J(·), the expected reward and θ, the parameters of the policy. These gradients
are often very noisy and lead to unstable training. In this question, we show a simple method to
reduce the variance in the gradients.
(a) Show that adjusting the reward as R(τ ) := R(τ ) − b does not change the gradient estimate
in Eq. 9. [2 points]
(b) Compute the variance of ∇θJ(θ) and notice how subtracting b helped reduce it. What value
of b will lead to the least variance? [8 points]
Hint: The variance of J(θ) is written as V ar(x) = E[x
2
] − (E[x])2 where x is substituted as
R(τ )∇θ log πθ(τ )
5 Coding: Dynamic Programming and Deep Q-Learning [50 points
+ 15 bonus points]
Follow the instructions at this link to the HW4 coding webpage.
3A trajectory is defined as a sequence of states and actions i.e. (s0, a0, s1, a1, . . . sn−1, an−1, sn)
5