$30
CSC413/2516
Homework 5
Submission: You must submit your solutions as a PDF file through MarkUs1
. You can produce the file
however you like (e.g. LaTeX, Microsoft Word, scanner), as long as it is readable.
See the syllabus on the course website2
for detailed policies. You may ask questions about the assignment
on Piazza3
. Note that 10% of the homework mark may be removed for a lack of neatness.
Homework written by Harris Chan and Silviu Pitis.
mailto:csc413-2020-01-tas@cs.toronto.edu
1https://markus.teach.cs.toronto.edu/csc413-2020-01
2https://csc413-2020.github.io/assets/misc/syllabus.pdf
3https://piazza.com/class/k58ktbdnt0h1wx?cid=1
1
CSC413/2516 Winter 2020 with Professor Jimmy Ba Homework 5
1 Reversible Architectures
Reversible architectures are based on a reversible block. In this section, we will investigate a variant for
implementing reversible block with affine coupling layers.
Consider the following reversible affline coupling block:
y1 = exp(G(x2)) ◦ x1 + F(x2)
y2 = exp(s) ◦ x2,
where ◦ denotes element-wise multiplication. The each inputs x1, x2 ∈ R
d/2
. The functions F, G maps
from R
d/2 → R
d/2
. This modified block is identical to the ordinary reversible block, except that the inputs x1
and x2 are multiplied element-wise by vectors exp(F(x2)) and exp(s), where ◦ is the element-wise product.
You don’t need to justify your answers for this question, but doing so may help you receive partial credit.
1.1 Inverting reversible block [1pt]
As a warm up, give equations for inverting this block, i.e. computing x1 and x2 from y1 and y2. You may
use / to denote element-wise division.
1.2 Jacobian of the reversible block [1pt]
Give a formula for the Jacobian ∂y/∂x, where y denotes the concatenation of y1 and y2. You may denote
the solution as a block matrix, as long as you clearly define what the matrix for each block corresponds to.
Note: You may leave the derivative of G and F with respect to x2 as ∂G
∂x2
and ∂F
∂x2
1.3 Determinant of Jacobian [1pt]
Give a formula for the determinant of the Jacobian from previous part, i.e. compute det
∂y
∂x
.
2
CSC413/2516 Winter 2020 with Professor Jimmy Ba Homework 5
2 Policy gradients and black box optimization
Very often we have a function f that does not give us useful gradient information: input or output may
be discrete; f may be piecewise constant, nowhere differentiable, or have pathological gradients (e.g., a
discontinuous saw wave on an incline, whose gradient always points away from the global optimum); or f
may be a black box that we cannot backpropagate through. For example, we may have a phone app that
labels photos as cats or dogs. This situation is the default in Reinforcement Learning (RL), where we can
execute the environment dynamics, but we cannot see or control their internals.
We still, however, want to optimize some score function J[f] : X → R. For example, in RL, we want to
learn a policy that maximizes the non-differentiable environment reward.
When using the REINFORCE strategy, we replaced the θ optimization task with a Monte-Carlo approximation. One of the key factors for a successful REINFORCE application is the variance. The higher the
variance, the more “noisy” the gradient estimates will be, which can slow down the optimization process. In
this section we will derive the variance of the REINFORCE estimator for a simple toy task.
Consider a loss function, f(˜a) which is the zero-one loss of the logistic regression output, p(a|θ). The
input vector has D independent scalar features, xd. We evaluate the performance of the classifier by sampling
from the output of the sigmoid µ. The loss function J(θ) can be written as:
µ = σ
X
D
d=1
θdxd
, (1)
p(a|θ) = Bernoulli(µ) = (
µ a = 1
1 − µ a = 0
, (2)
a˜ ∼ p(a|θ), (3)
f(˜a) = (
1 ˜a = 1
0 ˜a = 0
, (4)
J(θ) = Ea˜∼p(a|θ)
[f(˜a)]. (5)
2.1 Closed form expression for REINFORCE estimator [1pt]
Recall from above that the expression for REINFORCE estimator is:
∇θJ[θ] = Ea˜∼p(a|θ)
f(˜a)
∂
∂θ log p(a = ˜a|θ)
(6)
We can denote the expression inside the expectation as g[θ, x]:
g[θ, a˜] = f(˜a)
∂
∂θ log p(a = ˜a|θ), a˜ ∼ p(a|θ) (7)
For this question, derive a closed form for the g[θ, a˜] as a deterministic function of ˜a, µ, θ, and xd.
Hint: Substitute in the log likelihood of the Bernoulli distribution.
2.2 Variance of REINFORCE estimator [1pt]
We will derive the variance of the REINFORCE estimator above. Since the gradient is is D-dimensional,
the covariance of the gradients will be D × D matrix. In this question, we will only consider the variance
with respect to the first parameter, i.e. V[ˆg[θ, , a˜]1] which scalar value corresponding to the first element in
the diagonal of the covariance matrix. Derive the variance of the gradient estimator as a function of the first
parameter vector: V[ˆg[θ, , a˜]1], as a function of µ, θ, and xd.
Hint: The second moment of a Bernoulli random variable is µ(1 − µ).
2.3 Variance of REINFORCE estimator with baselines [1pt]
Comment on the variance in Part 2.2. When do we expect learning to converge slowly in terms of the output
of the logistic regression model, µ?
3
CSC413/2516 Winter 2020 with Professor Jimmy Ba Homework 5
3 Approximate dynamic programming
Q-learning and Markov Decision Process are powerful tools to solve dynamic programming problems using
neural networks. For this question we will study a simpler setup to understand the properties of these approximate dynamic programming algorithm. Consider a finite-state Markov Reward Process (MRP) induced
by behavior policy π, which is described by a tuple (S, P π
, R, γ), where s ∈ S is a state, P
π
(s, s0
) is the
probability of transitioning from state s into state s
0
(
P
s
0 P
π
(s, s0
) = 1), R(s) is the reward in state s,
and γ ∈ [0, 1) is a discount factor. We define the value V
π
(s) of state s under policy π as the expected
sum of discounted rewards when starting in that state and following transition model P
π ad infinitum:
V
π
(s) = Est|P π,s0=s
P
t=0 γ
tR(st).
By evaluating V
π
(s) we are doing what is called policy valuation, which is a critical component of many
RL algorithms. For instance, in policy iteration, we interleave policy valuation steps with policy improvement
steps, and in Q-learning, we are evaluating the greedy policy with respect to the current value function.
To simplify things we will assume that we know the transition model P
π
(or at least, can compute
dynamic programming estimates as if it were known), and consider solving this problem using synchronous
1-step dynamic programming or Bellman updates. Letting v
π be the true value vector, whose i-th entry is
V
π
(si), P
π be the transition matrix, whose (i, j)th entry is P
π
(si
, sj ), and r be the reward vector whose
i-th entry is R(si). We have the following dynamic programming relation (the Bellman equation):
v
π = T
πv
π , r + γP
πv
π
, (why?)
where T
π
is known as the “Bellman operator”. Note that v
π
is the only vector v ∈ R
|S|
that satisfies the
Bellman relation, since:
v
π + = T
π
(v
π + ) (8)
⇐⇒ v
π = r + γP
π
(v
π + ) − (9)
⇐⇒ γP
π
− = 0 (10)
⇐⇒ = 0 (why?) (11)
To get a better understanding of the Bellman updates, let us consider a linear regression model to
approximate the value function:
vˆ
π = Xw ≈ v
π
. (12)
where we represent the states using feature matrix X ∈ R
|S|×d whose ith row is the d-dimensional feature
vector for the ith state, and model the value function as vˆ
π = Xw ≈ v
π
.
In this case (and also when using neural networks to approximate value functions), T
πXw may not be
representable as Xw0
for any w0
. Thus, approximate dynamic programming must project T
πXw back into
the rowspace of X. This is done using an additional projection operator Π, so that the Bellman operator
becomes ΠT
π
. Unfortunately, the projected Bellman operator does not guarantee convergence, as we will
see in Part 3.2 below.
3.1 Approximate policy valuation [1pt]
The usual synchronous (update all states) policy valuation (dynamic programming) update is:
vt+1 ← vt + α(r + γP
πvt − vt) (13)
for some learning rate α ∈ (0, 1). As noted above Eq. 11, policy valuation converges.
When using a linear model to approximate the current value function vˆ = Xw, the approximate policy
evaluation becomes minimizing the following objective function:
J(wt) = 1
2
kT
πXwt − Xwtk
2
2
, (14)
where wt in (14) is a copy of wt through which gradients do not flow (i.e., wt = wt but ∇wtwt = 0).
Derive the gradient descent update rule for the weights wt+1 ← wt − α∇wt J in terms of γ, P
π
, X, w
and learning rate α.
4
CSC413/2516 Winter 2020 with Professor Jimmy Ba Homework 5
3.2 Divergence of approximate policy valuation [0pt]
In the previous Part we set X = I. In approximate RL and DP using , we use neural networks and
parameterized models to enable generalization. Unfortunately, this can cause problems. The following is a
minimal example to show that dynamic programming with linear function approximation can diverge. It is
inspired by “Tsitsiklis and Van Roy’s Counterexample” [2]. For a more classic (but also more complicated)
example of divergence, see “Baird’s counterexample” [1].
Consider approximate MRP with:
X =
1
2
, P =
1 −
1 −
, and r =
0
0
.
Here there are two states and d = 1 so we have parameter vector w ∈ R
1
, and since r = 0, it is clear that we
have optimal w∗ = 0 for all possible γ. We use the same loss as above: J(wt) = 1
2
kr + γP Xwt − Xwtk
2
2
.
Derive an expression for wt+1 in terms of wt, the learning rate α ∈ (0, 1), γ ∈ (0, 1) and ∈ (0, 1)—your
answer should be in the form wt+1 = Awt, where A is a function of α, γ, and .
Find one setting of α, γ, and , for which repeat updates will diverge (there are many) given any
initialization w0 6= 0.
Is ΠT
π a contraction in this case? (Note: since we are running one step of gradient descent rather than
doing a full projection on the column space of X, Π isn’t a minimum norm projection. It turns out, however,
that solving for the minimum norm solution here also results in divergence.)
5
CSC413/2516 Winter 2020 with Professor Jimmy Ba Homework 5
References
[1] Baird, L. Residual algorithms: Reinforcement learning with function approximation. In Machine
Learning Proceedings 1995. Elsevier, 1995, pp. 30–37.
[2] Tsitsiklis, J. N., and Van Roy, B. Analysis of temporal-diffference learning with function approximation. In Advances in neural information processing systems (1997), pp. 1075–1081.
6