Starting from:

$30

Homework 3 600.482/682 Deep Learning

Homework 3
600.482/682 Deep Learning

Please submit a latex generated PDF
to Gradescope with entry code 9G83Y7
1. We have talked about backpropagation in class. And here is a supplementary material for
calculating the gradient for backpropagation (https://piazza.com/class_profile/get_
resource/jxcftju833c25t/k0labsf3cny4qw). Please study this material carefully before
you start this exercise. Suppose P = W X and L = f(P) which is a loss function.
(a) Please show that ∂L
∂W =
∂L
∂P XT
. Show each step of your derivation.
(b) Suppose the loss function is L2 loss. L2 loss is de?ned as L(y, yˆ) = ky − yˆk
2 where y
is the groundtruth; yˆ is the prediction. Given the following initialization of W and X,
please calculate the updated W after one iteration. (step size = 0.1)
W =
?
0.3 0.5
−0.2 0.4
?
, X =

x1, x2
?
=
?
0 2
3 1?
, Y =

y1, y2
?
=
?
0.5 1
1 −1.5
?
2. In this exercise, we will explore how vanishing and exploding gradients a?ect the learning
process. Consider a simple, 1-dimensional, 3 layer network with data x ∈ R, prediction
yˆ ∈ [0, 1], true label y ∈ {0, 1}, and weights w1, w2, w3 ∈ R, where weights are initialized
randomly via ∼ N (0, 1). We will use the sigmoid activation function σ between all layers,
and the cross entropy loss function L(y, yˆ) = −(y log(ˆy) + (1 − y) log(1 − yˆ)). This network
can be represented as: yˆ = σ(w3 · σ(w2 · σ(w1 · x))). Note that for this problem, we are not
including a bias term.
(a) Compute the derivative for a sigmoid. What are the values of the extrema of this
derivative, and when are they reached?
(b) Consider a random initialization of w1 = 0.25, w2 = −0.11, w3 = 0.78, and a sample
from the data set (x = 0.63, y = 1). Using backpropagation, compute the gradients for
each weight. What have you noticed about the magnitude of the gradient?
Now consider that we want to switch to a regression task and use a similar network
structure as we did above: we remove the ?nal sigmoid activation, so our new network
is de?ned as yˆ = w3 · σ(w2 · σ(w1 · x)), where predictions yˆ ∈ R and targets y ∈ R; we
use the L2 loss function instead of cross entropy: L(y, yˆ) = (y−yˆ)
2
. Derive the gradient
of the loss function with respect to each of the weights w1, w2, w3.
(c) Consider again the random initialization of w1 = 0.25, w2 = −0.11, w3 = 0.78, and a
sample from the data set (x = 0.63, y = 128). Using backpropagation, compute the
gradients for each weight. What have you noticed about the magnitude of the gradient? 

More products