$30
CSC321
Homework 3
Submission: You must submit your solutions as a PDF file through MarkUs1
. You can produce
the file however you like (e.g. LaTeX, Microsoft Word, scanner), as long as it is readable.
Late Submission: MarkUs will remain open until 2 days after the deadline; until that time, you
should submit through MarkUs. If you want to submit the assignment more than 2 days late,
please e-mail it to csc321ta@cs.toronto.edu. The reason for this is that MarkUs won’t let us
collect the homeworks until the late period has ended, and we want to be able to return them to
you in a timely manner.
Weekly homeworks are individual work. See the Course Information handout2
for detailed policies.
1. Hard-Coding a Network. [2pts] In this problem, you need to find a set of weights and
biases for a multilayer perceptron which determines if a list of length 4 is in sorted order.
More specifically, you receive four inputs x1, . . . , x4, where xi ∈ R, and the network must
output 1 if x1 < x2 < x3 < x4, and 0 otherwise. You will use the following architecture:
All of the hidden units and the output unit use a hard threshold activation function:
φ(z) =
1 if z ≥ 0
0 if z < 0
Please give a set of weights and biases for the network which correctly implements this function
(including cases where some of the inputs are equal). Your answer should include:
• A 3 × 4 weight matrix W(1) for the hidden layer
• A 3-dimensional vector of biases b
(1) for the hidden layer
• A 3-dimensional weight vector w(2) for the output layer
• A scalar bias b
(2) for the output layer
You do not need to show your work.
1
https://markus.teach.cs.toronto.edu/csc321-2018-01
2
http://www.cs.toronto.edu/~rgrosse/courses/csc321_2018/syllabus.pdf
1
CSC321 Winter 2018 Homework 3
2. Backprop. Consider a neural network with N input units, N output units, and K hidden
units. The activations are computed as follows:
z = W(1)x + b
(1)
h = σ(z)
y = x + W(2)h + b
(2)
,
where σ denotes the logistic function, applied elementwise. The cost will involve both h and
y:
E = R + S
R = r
>h
S =
1
2
ky − sk
2
for given vectors r and s.
• [1pt] Draw the computation graph relating x, z, h, y, R, S, and E.
• [3pts] Derive the backprop equations for computing x = ∂E/∂x. You may use σ
0
to
denote the derivative of the logistic function (so you don’t need to write it out explicitly).
3. Sparsifying Activation Function. [4pts] One of the interesting features of the ReLU
activation function is that it sparsifies the activations and the derivatives, i.e. sets a large
fraction of the values to zero for any given input vector. Consider the following network:
Note that each wi refers to the weight on a single connection, not the whole layer. Suppose
we are trying to minimize a loss function L which depends only on the activation of the
output unit y. (For instance, L could be the squared error loss 1
2
(y − t)
2
.) Suppose the unit
h1 receives an input of -1 on a particular training case, so the ReLU evaluates to 0. Based
only on this information, which of the weight derivatives
∂L
∂w1
,
∂L
∂w2
,
∂L
∂w3
are guaranteed to be 0 for this training case? Write YES or NO for each. Justify your
answers.
2