$30
Deep Reinforcement Learning HW4:
Model-Based RL
1 Introduction
The goal of this assignment is to get experience with model-based reinforcement
learning. Model-based reinforcement learning consists of two main parts: learning a dynamics model, and using a controller to plan and execute actions that
minimize a cost function. You will be implementing both the learned dynamics
model and the controller in this assignment. This assignment is based on this
paper.
2 Model-Based Reinforcement Learning
We will now provide a brief overview of model-based reinforcement learning
(MBRL), and the specific type of MBRL you will be implementing in this
homework. Please see Lecture 11: Model-Based Reinforcement Learning (with
specific emphasis on the slides near page 12) for additional details.
MBRL consists primarily of two aspects: (1) learning a dynamics model and (2)
using the learned dynamics models to plan and execute actions that minimize
a desired cost function.
2.1 Dynamics Model
In this assignment, you will learn a neural network dynamics model of the
form:
∆ˆ
t+1 = fθ(st, at) (1)
such that
ˆst+1 = st + ∆ˆ
t+1 (2)
1
in which the neural network fθ encodes the difference between the next state
and the current state (given the current action that was executed).
You will train fθ by performing gradient descent on the following objective:
L(θ) = X
(st,at,st+1)∈D
k(st+1 − st) − fθ(st, at)k
2
2
(3)
=
X
(st,at,st+1)∈D
k∆t+1 − ∆ˆ
t+1k
2
2
(4)
2.2 Action Selection
Given the learned dynamics model, we now want to select and execute actions
that minimize a specified cost function. Ideally, you would calculate these actions by solving the following optimization:
a
∗ = arg min
at:∞
X∞
t
0=t
c(ˆst
0 , at
0 ) s.t. ˆst
0+1 = ˆst
0 + fθ(ˆst
0 , at
0 ). (5)
However, solving Eqn. 5 is impractical for two reasons: (1) planning over an
infinite sequence of actions is generally difficult and (2) the learned dynamics
model is inaccurate, so planning far into the future will also be inaccurate.
Instead, we will solve the following optimization:
A∗ = arg min
{A(0),...,A(K−1)}
t+
X
H−1
t
0=t
c(ˆst
0 , at
0 ) s.t. ˆst
0+1 = ˆst
0 + fθ(ˆst
0 , at
0 ), (6)
in which A(k) are each a random action sequence of length H. What Eqn. 6
says is to consider K random action sequences of length H, predict the future
states for each action sequence using the learned dynamics model fθ, evaluate
the total cost of each candidate action sequence, and select the action sequence
with the lowest cost.
However, Eqn. 6 only plans H steps into the future. This is problematic because
we would like our agent to execute long-horizon tasks. We therefore adopt a
model predictive control (MPC) approach, in which we solve Eqn. 6, execute
the first action, proceed to the next state, resolve Eqn. 6 at the next state, and
repeat this process.
2.3 On-Policy Data Collection
Although MBRL is in theory off-policy—meaning it can learn from any data—
in practice it will perform badly if you don’t have on-policy data. We can
2
therefore use on-policy data collection to improve the policy’s performance.
This is summarized in the algorithm below:
Algorithm 1 Model-Based Reinforcement Learning with On-Policy Data
Run base policy π0(at, st) (e.g., random policy) to collect D = {(st, at, st+1)}
while not done do
Train fθ using D (Eqn. 4)
st ← current agent state
for t = 0 to T do
A∗ ← by solving Eqn. 6 at state st
at ← first action in A∗
Execute at and proceed to next state st+1
Add (st, at, st+1) to D
end
end
3 Code
You will implement the MBRL algorithm described in the previous section.
3.1 Installation
Obtain the code from https://github.com/berkeleydeeprlcourse/homework/
tree/master/hw4. In addition to the installation requirements from previous
homeworks, install additional required packages by running:
pip install -r requirements.txt
3.2 Overview
You will modify two files:
• model based policy.py
• mode based rl.py
You should also familiarize yourself with the following files:
• main.py
• utils.py
• run all.sh
All other files are optional to look at.
3
4 Implementation
Problem 1
What you will implement: The neural network dynamics model and train it using a fixed dataset consisting of rollouts collected by a random policy.
Where in the code to implement: All parts of the code where you find
### PROBLEM 1
### YOUR CODE HERE
Implementation details are in the code.
How to run:
python main.py q1
What will be outputted: Plots of the learned dynamics model’s predictions vs.
ground truth will be saved in the experiment folder.
What will a correct implementation output: Your model’s predictions should be
similar to the ground-truth for the majority of the state dimensions.
Problem 2
What will you implement: Action selection using your learned dynamics model
and a given cost function.
Where in the code to implement: All parts of the code where you find
### PROBLEM 2
### YOUR CODE HERE
Implementation details are in the code.
How to run:
python main.py q2
What will be outputted: The log.txt file saved in the experiment folder will tell
you the ReturnAvg and ReturnStd for the random policy versus your modelbased policy.
What will a correct implementation output: The random policy should achieve
a ReturnAvg of around -160, while your model-based policy should achieve a
ReturnAvg of around 0.
4
Problem 3a
What will you implement: MBRL with on-policy data collection.
Where in the code to implement: All parts of the code where you find
### PROBLEM 3
### YOUR CODE HERE
Implementation details are in the code.
How to run:
python main.py q3
What will be outputted: The log.txt / log.csv file saved in the experiment folder
will tell you the ReturnAvg for each iteration of on-policy data collection.
What will a correct implementation output: Your model-based policy should achieve
a ReturnAvg of around 300 by the 10th iteration.
Problem 3b
What will you implement: You will compare the performance of your MBRL
algorithm when varying three hyperparameters: the number of random action
sequences considered during actions selection, the MPC planning horizon, and
the number of neural network layers for the dynamics model.
Where in the code to implement: There is nothing additional to implement.
How to run:
python main.py q3 --exp_name action128 --num_random_action_selection 128
python main.py q3 --exp_name action4096 --num_random_action_selection 4096
python main.py q3 --exp_name action16384 --num_random_action_selection 16384
python plot.py --exps HalfCheetah_q3_action128 HalfCheetah_q3_action4096 \
HalfCheetah_q3_action16384 --save HalfCheetah_q3_actions
python main.py q3 --exp_name horizon10 --mpc_horizon 10
python main.py q3 --exp_name horizon15 --mpc_horizon 15
python main.py q3 --exp_name horizon20 --mpc_horizon 20
python plot.py --exps HalfCheetah_q3_horizon10 HalfCheetah_q3_horizon15 \
HalfCheetah_q3_horizon20 --save HalfCheetah_q3_mpc_horizon
python main.py q3 --exp_name layers1 --nn_layers 1
python main.py q3 --exp_name layers2 --nn_layers 2
python main.py q3 --exp_name layers3 --nn_layers 3
python plot.py --exps HalfCheetah_q3_layers1 HalfCheetah_q3_layers2 \
HalfCheetah_q3_layers3 --save HalfCheetah_q3_nn_layers
What will be outputted: Three plots will be saved to the plots/ folder.
5
Extra Credit
Implement one of the following improvements:
(i) Instead of performing action selection using random action sequences, use
the Cross Entropy Method (CEM). (See pseudo-code on Wikipedia.) How
much does CEM improve performance?
(ii) The current loss function is only for one-step (i.e., we train on (st, at, st+1)
tuples). Extend the loss function to be multi-step (i.e., train on (st, at, ..., at+N−1, st+N ))
for N 1. How much do multi-step losses improve performance?
5 PDF Deliverable
You can generate all results needed for the deliverables by running:
bash run_all.sh
Please provide the following plots and responses on the specified pages.
Problem 1 (page 1)
(a) Provide a plot of the dynamics model predictions when the predictions
are mostly accurate.
(b) For (a), for which state dimension are the predictions the most inaccurate?
Give a possible reason why the predictions are inaccurate.
Problem 2 (page 2)
(a) Provide the ReturnAvg and ReturnStd for the random policy and for your
model-based controller trained on the randomly gathered data.
Problem 3a (page 3)
(a) Plot of the returns versus iteration when running model-based reinforcement learning.
Problem 3b (page 4)
(a) Plot comparing performance when varying the MPC horizon.
(b) Plot comparing performance when varying the number of randomly sampled action sequences used for planning.
(c) Plot comparing performance when varying the number of neural network
layers for the learned dynamics model.
[optional] Extra credit (page 5)
(a) Plot comparing performance of either CEM to random for action selection,
or multi-step loss training to single-step loss training.
6
6 Submission
Turn in both parts of the assignment on Gradescope as one submission. Upload
the zip file with your code to HW4 Code, and upload the PDF of your report
to HW4.
7