Starting from:

$29

Assignment #3 Policy Gradient Methods

1 Policy Gradient Methods (45pts coding + 20pts writeup)
The goal of this problem is to experiment with policy gradient and its variants, including variance reduction
methods. Your goals will be to set up policy gradient for both continuous and discrete environments, and
implement a neural network baseline for variance reduction. The framework for the vanilla policy gradient
algorithm is setup in pg.py, and everything that you need to implement is in this file. The file has detailed
instructions for each implementation task, but an overview of key steps in the algorithm is provided here.
1.1 REINFORCE
Recall the vanilla policy-gradient theorem,
∇θV (θ) = Eπθ
[∇θ log πθ(s, a)Q
πθ
(s, a)]
REINFORCE is a monte-carlo policy gradient algorithm, so we will be using the sampled returns Gt as
unbiased estimates of Qπθ (s, a). Then the gradient update can be expressed as maximize the following
reward function:
V (θ) = 1
D
X
τ∈D
X
T
t
log(πθ(st, at))Gt
where D is the set of all trajectories collected by policy πθ, and τ = (s0, a0, r0, s1...) is a trajectory.
1.2 Baseline
One difficulty of training with the REINFORCE algorithm is that the monte-carlo estimated return Gt
can have high variance. To reduce variance, we subtract a baseline b(s) from the estimated returns when
computing the policy gradient. A good baseline is the state value function, b(s) = V
πθ (s), which requires a
training update to minimize the mean-squared error loss:
V (b) = 1
D
X
τ∈D
X
T
t
(b(st) − Gt)
2
1
CS 234: Assignment #3
1.3 Advantage Normalization
A second variance reduction technique is to normalize the computed advantages so that they have mean 0
and standard deviation 1. From a theoretical perspective, we can consider centering the advantages to be
simply adjusting the advantages by a constant baseline, which does not change the policy gradient. Likewise,
rescaling the advantages effectively changes the learning rate by a factor of 1/σ, where σ is the standard
deviation of the empirical advantages.
1.4 Writeup Questions
(a) (8 pts) (CartPole-v0) Test your implementation on the CartPole-v0 environment with the given configuration
parameters (in config.py) by running
python pg.py
With the given configuration file, this should achieve a perfect score of 200 within the 100 iterations.
NOTE: training may repeatedly converge to 200 and diverge. Your plot does not have to reach 200 and
stay there. We only require that you achieve a perfect score of 200 sometime during training. Include
in your writeup the tensorboard plot for the average reward. Start tensorboard with:
tensorboard --logdir=results/CartPole-v0
and then navigate to the link it gives you. Click on the ”SCALARS” tab to view the average reward
graph.
Now, modify the configuration file to try running with larger and smaller batch sizes. What happens as
you increase/decrease the batch size, and why? Note that as you perform multiple runs, tensorboard
will generate multiple result files, so you should either remove previous run’s results, or change the
output path for each run in the configuration file. Try running without the baseline, and include the
plot. Do you notice any difference? Explain.
(b) (5 pts) (InvertedPendulum-v1) Test your implementation on the InvertedPendulum-v1 environment by modifying config.py. To use this environment (and HalfCheetah-v1), you will have to install MuJoCo
and mujoco-py. Instructions for installation are in the README.txt file in the starter code. We
have ran into errors installing MuJoCo on Windows, so Windows users may instead need to use the
FarmShare servers.
Find configuration parameters so that training reaches the maximum episode reward of 1000 within
100 iterations. Again, training will probably reach 1000, and then repeatedly diverge and converge. We
only require that you reach 1000 sometime during trainig. Include the tensorboard plot in your writeup.
(c) (7 pts) (HalfCheetah-v1) Test your implementation on the much more difficult HalfCheetah-v1 environment,
with γ = 0.9 so that training reaches an average episode reward of more than 200 within 100 iterations.
Include the tensorboard plot, and explain your parameter choices you make in the writeup. Hint: a
large batch size of 50000 works well for us, and it may help to change the size of the policy network.
Page 2 of 3
CS 234: Assignment #3
2 Expected Regret Bounds (35pts)
Assume a reinforcement learning algorithm A for discounted infinite-horizon MDPs has expected regret
E∗
"X
T
t=1
rt
#
− EA
"X
T
t=1
rt
#
= f(T)
for all T 0, where E∗ is over the probability distribution with respect to the optimal policy π∗ and EA is
the expectation with respect to the algorithm’s behavior. We assume that γ ∈ [0, 1) is the discount factor
and that rewards are normalized, i.e., rt ∈ [0, 1].
(a) (15 pts) Let π be an arbitrary policy or algorithm. Show that for any ?
0 0 and T
0 ≥ log 1
γ
H
?
0 where H =
1/(1 − γ), we have






Vπ(s) −
T
X0
t=1
γ
t−1Eπ[rt|s1 = s]






≤ ?
0
, for all state s.
Note Vπ is the value function associated with π and Eπ is the expectation with respect to the randomization of π.
(b) (20pts) From the regret guarantee of algorithm A and Part a), it follows that for any ?
0 0 and T
0 ≥ log 1
γ
H
?
0
,
we have
E∗[V∗(sT +1)] − EA[VA(sT +1)] ≤ f(T
0 + T) − f(T) + 2?
0
, for T 0,
where VA is the value function of the (possibly nonstationary) policy that algorithm A follows. Assume
now f(T) = √
T. Show that for any ? 0 and t ≥ 1 + 1
?
2
?
log 1
γ
4H
?
?2
, we have
E∗[V∗(st)] − EA[VA(st)] ≤ ? .
Hint: It may be helpful to set ?
0
to be some function of ? and choose an appropriate value of T
0
.

More products