$29
CS410: Artificial Intelligence
Homework 4: Reinforcement Learning & Regression
1. TD Learning.
(a) Consider an MDP with states A, B, C, D, E, and F, where state
F is the terminal state. The agent will receive a reward +1 if it
transits to the terminal state F or receive a reward 0 otherwise and
the discount factor γ = 1. Assume the current estimates of V at
time t is Vt(A) = 0.2, Vt(B) = 0.4, Vt(C) = 0.6, Vt(D) = 0.8,
Vt(E) = 1.0, and Vt(F) = 0. Further, suppose that our agent is at
state C at time t and it will experience the following transitions shown
in Figure 1 from time t+1 to time t+7. Use temporal difference (TD)
learning with learning rate α = 1/2 to compute the new estimates
of V values of corresponding states each time the agent experiences
a new transition. Finish this exercise by explicitly showing your
computation process.
Vt+1(C) = , Vt+2(B) = , Vt+3(A) = , Vt+4(B) = ,
Vt+5(C) = , Vt+6(D) = , Vt+7(E) = .
Figure 1: Problem 1 (a).
(b) Prove the statement in Lecture 7, Slide 38 that decreasing learning
rate (α) can give converging averages in TD learning.
Update: Prove the statement in Lecture 7, Slide 38 that decreasing
learning rate (α) can give converging averages in TD learning by
verifying that the sequence {Vn} with Vn = (1 − αn)Vn−1 + αnxn
is a Cauchy sequence under the assumption that ∀n > 0, αn =
1
n2 ,
|xn| ≤ C1 and |Vn| ≤ C2 for some constants C1 > 0 and C2 > 0. )
1
2. Q-Learning. Recall the statement in Lecture 7, Slide 45 that Q-learning
converges to optimal policy – even if you’re acting suboptimally. The
task in this exercise is to prove this statement step by step. Let us first
cover some preliminaries. Denote by (X, d) the metric space, where X is
a space and the metric d : X × X → R is defined on pairs of elements
in X. For example, let X = R
n and d(x1, x2) = kx1 − x2k2 for any
x1, x2 ∈ X, where k · k2 denotes the 2-norm. Then d(x1, x2) denotes the
Euclidean distance between x1 and x2. Let H : X → X be a mapping
between the space X and itself. If there exists some 0 < L < 1, s.t.
∀x, y ∈ X, d(H(x), H(y)) ≤ L · d(x, y), then the mapping H is defined as
a contradiction mapping. If there exists a point x ∈ X s.t. H(x) = x,
then x is defined as the fixed-point of mapping H. If H is a contradiction
mapping, then it could be shown that H admits a unique fixed-point x
∗
in X.
(a) Recall that the optimal Q∗
satisfies the optimal Bellman equation
Q
∗
(s, a) = X
s
0
T (s, a, s0
)
h
R (s, a, s0
) + γ max
a0
Q
∗
(s
0
, a0
)
i
.
If we treat X as the space of Q functions (i.e., each element q in X
is a Q function, which further specifies a Q value q(s, a) for a given
state-action pair (s, a)), and treat H as
H : q(s, a) 7→
X
s
0
T (s, a, s0
)
h
R (s, a, s0
) + γ max
a0
q (s
0
, a0
)
i
.
Then it is clear that Q∗
is the fixed-point of H since Q∗ = H(Q∗
).
Assume the discount factor 0 < γ < 1. Prove that H is a contradiction mapping with respect to the metric
d(q1, q2) = ||q1 − q2||∞ = max
s,a
|q1(s, a) − q2(s, a)|
for any two Q functions q1 and q2, where k · k∞ is the maximum
norm (i.e., prove that kH(q1) − H(q2)k∞ ≤ γkq1 − q2k∞ for any two
Q functions q1 and q2).
(b) Consider an finite MDP (S, A, T, R, γ) with finite state space S and
finite action space A. Assume the reward function R is bounded
and deterministic and 0 < γ < 1. Recall that Q-learning updates Q
function as
Qt+1(st, at) = Qt(st, at) − αt(Qt(st, at) − samplet
)
= (1 − αt)Qt(st, at) + αt · samplet
,
where samplet = R(st, at, s0
) + γ maxa0 Qt(s
0
, a0
), 0 ≤ αt ≤ 1 is the
learning rate at time t, and {st} is the sequence of states obtained
following policy π, which satisfies Pπ [At = a | St = s] > 0 for all
2
state-action pairs (s, a). Prove that Q-learning converges to Q∗
P
if
t αt = ∞ and P
t α
2
t < ∞. First construct a sequence ∆t+1(s, a) =
Qt(s, a)−Q∗
(s, a) using the update rule of Q-learning and Q∗
. Then
verify that the three assumptions in Lemma 1 hold. Finally apply
Lemma 1 to finish this exercise.
Update: Further assume that Q∗
(s, a) is bounded for all the stateaction pair (s, a), and Qt(s, a) is bounded for all the state-action pair
(s, a), ∀t > 0.
Lemma 1. The random process {∆t} taking values in R and defined
as
∆t+1(x) = (1 − αt) ∆t(x) + αtFt(x)
converges to 0 under the following assumptions:
• 0 ≤ αt ≤ 1,
P
t αt = ∞ and P
t α
2
t < ∞;
• kE [Ft | Ft]k∞ ≤ γ k∆tk∞ with γ < 1, where Ft = {∆t, ∆t−1, . . . , ∆1, Ft−1, . . . , F1}
stands for the past information at time t, and E [Ft | Ft] denotes
the conditional expectation of Ft given Ft;
• V [Ft(x) | Ft] ≤ C (1 + k∆tk∞)
2
for some constant C > 0, where
V [Ft(x) | Ft] denotes the conditional variance of Ft(x) given Ft.
3. Regression. Recall the exercise in Lecture 8, Slide 21. Some economists
say that the impact of GDP in ‘current year’ will have an effect on vehicle sales ‘next year’. So whichever year GDP was less, the coming
year sales were lower, and when GDP increased the next year vehicle
sales also increased. Let’s have the equation as y = θ0 + θ1x , where
y = number of vehicles sold in the year and x = GDP of the prior year.
We need to find θ0 and θ1. Here is the data between 2011 and 2016.
(a) What is the normal equation?
(b) Suppose the GDP increasement in 2017 is 7%, how many vehicles
will be sold in 2018?
3
Figure 2: Problem 3.
4