$30
Reinforcement Learning Assignment 4
1 Introduction
The goal of this assignment is to do experiment with deep Q network (DQN),
which combines the advantage of Q-learning and the neural network. In classical
Q-learning methods, the action value function Q is intractable with the increase
of state space and action space. DQN introduces the success of deep learning
and has achieved a super-human level of play in atari games. Your goal is to
implement the DQN algorithm and its improved algorithm and play with them
in some classical RL control scenarios.
2 Deep Q-learning
sometimes a nonlinear function approximator is used instead, such as a neural
network. We refer to a neural network function approximator with weights h as a
Q-network. A Q-network can be trained by adjusting the parameters hi at iteration
i to reduce the mean-squared error in the Bellman equation, where the optimal
target values rzc maxa0 Q s
0
,a0 ð Þ are substituted with approximate target values
y~rzc maxa0 Q s
0
,a0
; h{
i
, using parameters h{
i from some previous iteration.
This leads to a sequence of loss functions Li(hi) that changes at each iteration i,
Lið Þ hi ~ s,a,r Es ð Þ 0 ½ yDs,a {Q sð Þ ,a; hi 2 ? ?
~ s,a,r,s0 ð Þ y{Q sð Þ ,a; hi 2 ? ?zEs,a,r Vs ½ 0 ½ y :
Note that the targets depend on the network weights; this is in contrast with the
targets used for supervised learning, which are fixed before learning begins. At
each stage of optimization, we hold the parameters from the previous iteration hi
2
fixed when optimizing the ith loss function Li(hi), resulting in a sequence of welldefined optimization problems. The final term is the variance of the targets, which
does not depend on the parameters hi that we are currently optimizing, and may
therefore be ignored. Differentiating the loss function with respect to the weights
we arrive at the following gradient:
+hi
Lð Þ hi ~ s,a,r,s0 rzc max
a0 Q s
0
,a0
; h{
i
{Qð Þ s,a; hi
? ?+hiQð Þ s,a; hi
? :
Rather than computing the full expectations in the above gradient, it is often
computationally expedient to optimize the loss function by stochastic gradient
descent. The familiar Q-learning algorithm19 can be recovered in this framework
by updating the weights after every time step, replacing the expectations using
single samples, and setting h{
i ~hi{1.
Note that this algorithm is model-free: it solves the reinforcement learning task
directly using samples from the emulator, without explicitly estimating the reward
and transition dynamics P r,s
0 ð Þ Ds,a . It is also off-policy: it learns about the greedy
policy a~argmaxa0Q s,a0 ð Þ ; h , while following a behaviour distribution that ensures
adequate exploration of the state space. In practice, the behaviour distribution is
often selected by an e-greedy policy that follows the greedy policy with probability
1 2 e and selects a random action with probability e.
Training algorithm for deep Q-networks. The full algorithm for training deep
Q-networks is presented in Algorithm 1. The agent selects and executes actions
according to an e-greedy policy based on Q. Because using histories of arbitrary
length as inputs to a neural network can be difficult, our Q-function instead works
on a fixed length representation of histories produced by the function w described
above. The algorithm modifies standard online Q-learning in two ways to make it
suitable for training large neural networks without diverging.
First, we use a technique known as experience replay23 in which we store the
agent’s experiences at each time-step,et5 (st, at,rt,st 1 1), in a data setDt5 {e1,…,et},
pooled over many episodes (where the end of an episode occurs when a terminal state is reached) into a replay memory. During the inner loop of the algorithm,
we apply Q-learning updates, or minibatch updates, to samples of experience,
(s, a,r,s9) , U(D), drawn at random from the pool of stored samples. This approach
has several advantages over standard online Q-learning. First, each step of experience
is potentially used in many weight updates, which allowsfor greater data efficiency.
Second, learning directly from consecutive samples is inefficient, owing to the strong
correlations between the samples; randomizing the samples breaks these correlations and therefore reduces the variance of the updates. Third, when learning onpolicy the current parameters determine the next data sample that the parameters
are trained on. For example, if the maximizing action is to move left then the training samples will be dominated by samples from the left-hand side; if the maximizing action then switches to the right then the training distribution will also switch.
Itis easy to see how unwanted feedback loops may arise and the parameters could get
stuckin a poor localminimum, or evendiverge catastrophically20. By using experience
replay the behaviour distribution is averaged over many of its previous states,
smoothing out learning and avoiding oscillations or divergence in the parameters.
Note that when learning by experience replay, it is necessary to learn off-policy
(because our current parameters are different to those used to generate the sample), which motivates the choice of Q-learning.
In practice, our algorithm only stores the last N experience tuples in the replay
memory, and samples uniformly at random from Dwhen performing updates. This
approach is in some respects limited because the memory buffer does not differentiate important transitions and always overwrites with recent transitions owing
to the finite memory size N. Similarly, the uniform sampling gives equal importance to all transitions in the replay memory. A more sophisticated sampling strategy might emphasize transitions from which we can learn the most, similar to
prioritized sweeping30.
The second modification to online Q-learning aimed at further improving the
stability of our method with neural networks is to use a separate network for generating the targets yj in the Q-learning update. More precisely, every C updates we
clone the network Q to obtain a target network Q^ and use Q^ for generating the
Q-learning targets yj for the followingC updates to Q. This modification makes the
algorithm more stable compared to standard online Q-learning, where an update
that increasesQ(st,at) often also increasesQ(st 1 1,a)for all a and hence also increases
the target yj, possibly leading to oscillations or divergence of the policy. Generating
the targets using an older set of parameters adds a delay between the time an update
to Q is made and the time the update affects the targets yj
, making divergence or
oscillations much more unlikely.
We also found it helpful to clip the error term from the update rzc maxa0 Q
s
0
,a0
; h{
i
{Qð Þ s,a; hi to be between 21 and 1. Because the absolute value loss
function jxj has a derivative of 21 for all negative values of x and a derivative of 1
for all positive values of x, clipping the squared error to be between 21 and 1 corresponds to using an absolute value loss function for errors outside of the (21,1)
interval. This form of error clipping further improved the stability of the algorithm.
Algorithm 1: deep Q-learning with experience replay.
Initialize replay memory D to capacity N
Initialize action-value function Q with random weights h
Initialize target action-value function Q^ with weights h2 5 h
For episode 5 1, M do
Initialize sequence s1~f g x1 and preprocessed sequence w1~wð Þ s1
For t 5 1,T do
With probability e select a random action at
otherwise select at~argmaxaQ wð Þ st ð Þ ,a; h
Execute action at in emulator and observe reward rt and image xt 1 1
Set stz1~st,at,xtz1 and preprocess wtz1~wð Þ stz1
Store transition wt,at,rt,wtz1
in D
Sample random minibatch of transitions wj,aj,rj,wjz1
from D
Set yj~ rj if episode terminates at step jz1
rjzc maxa0 Q^ wjz1,a0
; h{
otherwise (
Perform a gradient descent step on yj{Q wj,aj; h
2
with respect to the
network parameters h
Every C steps reset Q^~Q
End For
End For
31. Jarrett, K., Kavukcuoglu, K., Ranzato, M. A. & LeCun, Y.What is the bestmulti-stage
architecture for object recognition? Proc. IEEE. Int. Conf. Comput. Vis. 2146–2153
(2009).
32. Nair, V. & Hinton, G. E. Rectified linear units improve restricted Boltzmann
machines. Proc. Int. Conf. Mach. Learn. 807–814 (2010).
33. Kaelbling, L. P., Littman, M. L. & Cassandra, A. R. Planning and acting in partially
observable stochastic domains. Artificial Intelligence 101, 99–134 (1994).
LETTER RESEARCH
©2015 Macmillan Publishers Limited. All rights reserved
Figure 1: Deep Q-learning with experience replay
You can refer to the original paper for the details of the DQN. “Human-level
control through deep reinforcement learning.” Nature 518.7540 (2015): 529.
3 Experiment Description
• Programming language: python3
• You should compare the performance of DQN and one kind of improved
DQN and test them in a classical RL control environment–MountainCar.
OPENAI gym provides this environment, which is implemented with python
(https://gym.openai.com/envs/MountainCar-v0/). What’s more, gym also
provides other more complex environment like atari games and mujoco.
2
Since the state is abstracted into car’s position, convolutional layer is not
necessary in our experiment. You can get started with OPENAI gym refer
to this link (https://gym.openai.com/docs/). Note that it is suggested to
implement your neural network on the Tensorflow or Pytorch.
4 Report and Submission
• Your report and source code should be compressed and named after “studentID+name+assignment4”.