$30
Assignment 1
1 Effect of Modeling Errors in Policy Evaluation [20 pts]
Consider deploying a Reinforcement Learning (RL) agent on an episodic MDP M with a horizon
of H timesteps. The true MDP M is never revealed to the agent except for the state and action
space (S and A, respectively) and the time horizon H. In other words, the agent knows neither the
expected reward r(s, a) nor the transition dynamics P(s
0
| s, a) for a generic state-action pair (s, a).
As the agent explores the environment in different episodes it builds an empirical model of the
world (often via maximum likelihood) which we denote by Mˆ . After collecting sufficient data we
would expect the empirical MDP to be similar to the true MDP.
Since our MDP is episodic, the value of a state depends on the time to termination of the episode.
Thus, for a given stochastic policy π, we require a different value function for each timestep, which
we denote V1, . . . , VH in M and Vˆ
1, . . . , VˆH in Mˆ . In particular, for i = 1, . . . , H, let
Vi(s) = E
"X
H
t=i
r(st
, at)
si = s
#
and Vˆ
i(s) = E
"X
H
t=i
rˆ(st
, at)
si = s
#
,
where the expectation is taken under following policy π. Suppose we use the empirical MDP Mˆ
instead of M to evaluate policy π. Assuming VˆH+1 = VH+1 = ~0 show that for all i = 1, . . . , H:
Vˆ
i(s) − Vi(s) = X
H
t=i
E
"
rˆ(st
, at) − r(st
, at) +X
s
0
(Pˆ(s
0
| st
, at) − P(s
0
| st
, at))Vˆ
t+1(s
0
)
si = s
#
.
In the above equality the expectation is defined with respect to the states encountered in true MDP
M upon starting from si and following stochastic policy π.
This result relates the value of policy π on Mˆ and M using the expected trajectories on M which
we can compute easily. If it holds that rˆ and Pˆ are close to r and P then this result can be used to
conclude that the empirical value function Vˆ is also close to the true one V .
If you use an existing result to derive the result, you need to cite the source.
1
2 Value Iteration [35 pts]
(a) After each iteration of value iteration you can extract the current best policy (given the current
value function, by taking an argmax instead of max). If the best policy at step k is the same
as the best policy at step k+1 when doing value iteration, can the policy ever change again
(with further iterations of value iteration)? Prove it cannot or disprove with a counterexample.
(b) Consider an MDP with finite state space and finite action set. Empirically we often halt value
iteration after a fixed number of steps for computational reasons, yielding an approximately
optimal value function V˜ . We then extract a greedy policy πV˜ from V˜ by taking the argmax
of one more backup. Let’s assume we know that the maximum difference across any state
is (for extra understanding, think about how the difference in |BV −V | allows us to ensure this)
|V
∗
(s) − V˜ (s)| ≤ ε, for all s.
Then policy πV˜ has values close to the optimal ones for each state. In particular, define the
loss function LV˜ as follows:
LV˜ (s) = V
∗
(s) − VπV˜
(s), for all s,
where V
∗
is the optimal value function and VπV˜
is the value function under policy πV˜ . Prove
LV˜ (s) ≤
2γε
1 − γ
for all states s. Here γ < 1 is the discount factor.
If you use outside sources you must write up your own solution and cite the used source.
3 Grid Policies [20 pts]
Consider the following grid environment. Starting from any unshaded square, you can move up,
down, left, or right. Actions are deterministic and always succeed (e.g. going left from state 1 goes
to state 0) unless they will cause the agent to run into a wall. The thicker edges indicate walls, and
attempting to move in the direction of a wall results in staying in the same square. Taking any
action from the green target square (no. 5) earns a reward of +5 and ends the episode. Taking any
action from the red square of death (no. 11) earns a reward of -5 and ends the episode. Otherwise,
each move is associated with some reward r ∈ {−1, 0, +1}. Assume the discount factor γ = 1 unless
otherwise specified.
2
20 21 22 23 24
15 16 17 18 19
10 11 12 13 14
5 6 7 8 9
0 1 2 3 4
(a) Define the reward r for all states (except state 5 and state 11 whose rewards are specified
above) that would cause the optimal policy to return the shortest path to the green target
square (no. 5).
(b) Using r from part (a), find the optimal value function for each square.
(c) Does setting γ = 0.8 change the optimal policy? Why or why not?
(d) All transitions are even better now: each transition now has an extra reward of 1 in addition
to the reward you defined in (a). Assume γ = 0.8 as in part (c). How would the value function
change? How would the policy change? Explain why.
4 Frozen Lake MDP [25 pts]
Now you will implement value iteration and policy iteration for the Frozen Lake environment from
OpenAI Gym (https://gym.openai.com/envs/FrozenLake-v0). We have provided custom versions of
this environment in the starter code.
(a) (coding) Implement policy iteration in vi_and_pi.py. The stopping tolerance (defined as
maxs |Vold(s) − Vnew(s)|) is tol = 10−3
. Use γ = 0.9. Return the optimal value function and
the optimal policy. [10pts]
(b) (coding) Implement value iteration in vi_and_pi.py. The stopping tolerance is tol = 10−3
.
Use γ = 0.9. Return the optimal value function and the optimal policy. [10 pts]
(c) (written) Run both methods on the Deterministic-4x4-FrozenLake-v0 and
Stochastic-4x4-FrozenLake-v0 environments. In the second environment, the dynamics of the
world are stochastic. How does stochasticity affect the number of iterations required, and the
resulting policy? [5 pts]
3