$35
Exercise II
AMTH/CPSC 663
Compress your solutions into a single zip file titled <lastname and
initials assignment2.zip, e.g. for a student named Tom Marvolo Riddle, riddletm assignment2.zip. Include a single PDF titled
<lastname and initials assignment2.pdf and any Python scripts
specified. Any requested plots should be sufficiently labeled for full points.
Programming assignments should use built-in functions in Python
and TensorFlow; In general, you may use the scipy stack [1]; however,
exercises are designed to emphasize the nuances of machine learning and
deep learning algorithms - if a function exists that trivially solves an entire
problem, please consult with the TA before using it.
Problem 1
1. Provide a geometric interpretation of gradient descent in the one-dimensional case. (Adapted from the
Nielsen book, chapter 1)
2. An extreme version of gradient descent is to use a mini-batch size of just 1. This procedure is known as
online or incremental learning. In online learning, a neural network learns from just one training input
at a time (just as human beings do). Name one advantage and one disadvantage of online learning
compared to stochastic gradient descent with a mini-batch size of, say, 20. (Adapted from the Nielsen
book, chapter 1)
3. Create a network that classifies the MNIST data set using only 2 layers: the input layer (784 neurons)
and the output layer (10 neurons). Train the network using stochastic gradient descent. What accuracy
do you achieve? You can adapt the code from the Nielson book, but make sure you understand each
step to build up the network. Please save your code as prob1.py. (Adapted from the Nielsen book,
chapter 1)
1
Problem 2
1. Alternate presentation of the equations of backpropagation (Nielsen book, chapter 2)
Show that δ
L = ∇aC ? σ
0
(z
L) can be written as δ
L =
P0
(z
L)∇aC, where Σ0
(z
L) is a square matrix
whose diagonal entries are the values σ
0
(z
L
j
) and whose off-diagonal entries are zero.
2. Show that δ
l = ((w
l+1)
T
δ
l+1) ? σ
0
(z
l
) can be rewritten as δ
l = Σ0
(z
l
)(w
l+1)
T
δ
l+1
.
3. By combining the results from problems 2.1 and 2.2, show that δ
l = Σ0
(z
l
)(w
l+1)
T
. . . Σ
0
(z
L−1
)(w
L)
T Σ
0
(z
L)∇aC.
4. Backpropagation with linear neurons (Nielsen book, chapter 2)
Suppose we replace the usual non-linear σ function (sigmoid) with σ(z) = z throughout the network.
Rewrite the backpropagation algorithm for this case.
Figure 1: Simple neural network with initial weights and biases.
Problem 3
1. It can be difficult at first to remember the respective roles of the ys and the as for cross-entropy. It’s easy
to get confused about whether the right form is −[ylna + (1 − y)ln(1 − a)] or −[alny + (1 − a)ln(1 − y)].
What happens to the second of these expressions when y=0 or 1? Does this problem afflict the first
expression? Why or why not? (Nielsen book, chapter 3)
2. Show that the cross-entropy is still minimized when σ(z) = y for all training inputs (i.e. even when
y ∈ (0, 1)). When this is the case the cross-entropy has the value: C = −
1
n
P
x
[ylny + (1 − y)ln(1 − y)]
(Nielsen book, chapter 3)
2
3. Given the network in Figure 1, calculate the derivatives of the cost with respect to the weights and the
biases and the backpropagation error equations (i.e. δ
l
for each layer l) for the first iteration using the
cross-entropy cost function. Initial weights are colored in red, initial biases are colored in orange, the
training inputs and desired outputs are in blue. This problem aims to optimize the weights and biases
through backpropagation to make the network output the desired results. More specifically, given
inputs 0.05 and 0.10, the neural network is supposed to output 0.01 and 0.99 after many iterations.
Problem 4
1. Download the python template prob4 1.py and read through the code which implements a neural
network with TensorFlow based on MNIST data. Implement the TODO part to define the loss and
optimizer. Compare the squared loss, cross entropy loss, and softmax with log-likelihood. Plot the
training cost and the test accuracy vs epoch for each loss function (in two separate plots). Which loss
function converges fastest?
2. Based on prob 4.1 add regularization to the previous network. Implement L2 and L1 regularizations
separately, and dropout separately. Compare the accuracy and report the final regularization parameters you used (for dropout, report the probability parameter). Are the final results sensitive to each
parameter? Please save your code as prob4 2.py. You may want to check out the following link for
regularization.
https://www.tensorflow.org/versions/r0.12/api docs/python/contrib.layers/regularizers
References
[1] “The scipy stack specification.” [Online]. Available: https://www.scipy.org/stackspec.html
3