$35
CPSC 470 – Artificial Intelligence
Problem Set #6 – Deep Neural Network
35 points
In this exercise, you will implement part of a deep neural network and apply it to the task of
hand-written digit recognition. This assignment is adapted from Andrew Ng’s machine learning
class on Coursera.
We encourage to type in your answers directly on this handout. Please only put answers in
the designated areas, as we will ignore anything outside those designated areas.
Linear Algebra and Numpy
Neural networks rely upon linear algebra. For the purpose of this assignment, you don’t need to
know a lot about matrices to finish this assignment as most of the code has been implemented for
you. However, if you find it challenging, here are some basics of linear algebra that may help:
https://minireference.com/static/tutorials/linear_algebra_in_4_pages.pdf
https://math.boisestate.edu/~wright/courses/m365/LAIntroSlides.pdf
To represent matrices in python, we will use the numpy library. Most of the code is already
implemented, but you may find the documentation for numpy (http://www.numpy.org/) or these
other references useful:
http://cs231n.github.io/python-numpy-tutorial/#numpy
https://www.numpy.org/devdocs/user/quickstart.html
All of the libraries needed of this assignment has been installed on zoo. As before, we
unfortunately won’t be able to help with library installation problems on personal machines.
Dataset
The dataset you will use is taken and modified from the MNIST digit dataset
(http://yann.lecun.com/exdb/mnist/). The dataset consists of 5000 handwritten digit images and
the corresponding labels. Each image is 20 by 20 pixels. Each pixel is represented by a
floating point number indicating the grayscale intensity at that location. The figure below shows
some examples from the dataset:
Neural Networks
We will use a typical model of an individual neuron:
A neuron computes a weighted sum of its inputs: z = w1 * x1 + w2 * x2 + w3 * x3 where w1,
w2, and w3 are the weights correspondingly. The output is a = g(z) where g is a non-linear
activation function. In this assignment, we use the sigmoid function which is defined by:
A sigmoid function looks like this:
In the basic sigmoid shown above, the transition point is at x=0. To shift the threshold to the left
or to the right, we can add an additional term into the sum. (For example, if we add -4 into the
sum, this is the same as having the sigmoid function transition at x=4.). This shift is called a
bias. To simplify the computation of the bias term, we will add one additional neuron to each
layer (except the output layer) and always assign an input value of 1 to that neuron. The weight
on that constant input thus determines the bias and can be adjusted automatically by any
algorithm that adjusts the network weights.
The neural network is composed of a connected set of individual neurons. There are many units
in each layer, and there are multiple layers. In this assignment, we will first use a 3-layer neural
net: an input layer, a hidden layer and an output layer. The input layer will have one input
neuron for each pixel in the image, plus one additional neuron for the bias term, giving a total of
401 input layer neurons.
The training data will be loaded into the variables train_x and train_y by the function
load_data(training_percentage). We will soon vary the training_percentage, but for now let’s
leave it as 1 (using the entire data set for training). The variable train_x contains 5000 vectorized
input images, and the variable train_y stores the corresponding correct output labels (such as 6,
1, 2, etc.) To visualize a sample from the training examples, you can use the function
display_digit_image(…). One example output from this display function is shown below:
We will first use a hidden layer with 25 neurons, and an output layer of 10 neurons.
Please answer the following question:
Q1. Why there are 10 units in the output layer? Please choose one (1 point):
A. Because 10 can be divided by 400 * 25;
B. Because there are 10 labels;
C. The number 10 is generated randomly, and it can be any number here.
D. Because 9 is a perfect square and then we add one for the bias neuron.
The first version of our neural network will look like this:
The vectorized image will be provided to the input layer, with each pixel corresponding to a
single input neuron. The output of the first layer is simply these pixel values (along with the
fixed +1 bias neuron.). These values will be weighted and then passed as inputs to the hidden
layer. The hidden layer will compute the weighted sum, pass the values through a sigmoid
activation function, and then pass along its output to the third layer. These are again weighted,
summed, and then passed through an activation function. The output layer neuron with the
largest output value will be considered the winner and the label representing that node will be the
image’s label. (This is the feed-forward process which computes the output of the system, given
an input image.)
Weights are initialized randomly, so our initial output will likely be very different from the true
label. A cost function is used to evaluate how different it is between the true label and the
predicted label. The backpropagation algorithm (as presented in class) is then used to iteratively
update the weights in the network to (hopefully) bring the actual output of the network to be
closer to the desired output.
Most of the code has already been implemented. Your task is to complete the deep_NN
function by choosing the correct functions. You will implement about 4 lines of code.
B
Feedforward
The function initialize_parameters returns a dictionary parameters, with the keywords “W1”,
“b1”, “W2” and “b2”. The “W*” represents the weight matrix, and the “b” is the bias vector. “1”
means the parameters from the input layer to hidden layer, and “2” represents the parameters
from the hidden layer to the output layer. By examine the output of the initialize_parameters
function, please fill in the dimensions of the parameters in the table below (you may need to
write a few lines of code to call the initialize_parameters, however, you don’t need to submit any
code you wrote you this part. You can use the np.shape() function).
For example, for the array which has 2 rows and 3 columns:
1 2 3
4 5 6
It will be initialized as a = np.array([[1, 2, 3], [4, 5, 6]]) and a.shape is: (2, 3)
Then you will fill in the form as
a 2 3
Q2. Please fill in the dimension (2 points)
W1 25 400
b1 25 1
W2 10 25
b2 10 1
Now let’s examine the dimensions of each layer. The feedforward function returns a dictionary
of caches, with the keyword “a1, z2, a2, z3, a3”. “z*” represents the weighted result at the
corresponding layer. “a*” represents the value of sigmoid(“z*”) at the layer. “1” refers to input
layer, “2” refers to the hidden layer, and “3” refers to the output layer. Please note that the
overall output of the neural net is also represented as AL/al, which should share the same value
as a3.
Please answer the following questions
Q3. Why there is no “z1”? Please choose one (1 point):
A. Because 1 is the input layer, the output of layer 1 is the pixel values themselves;
B. Because the TA made a mistake, there should be a “z1”;
C. “z1” should exist, but it is just not used in the calculation later, so this value is excluded.
Q4. Please fill in the dimensions below (4 points):
X (input, which is train_x) 400 5000
y (true label, which is train_y) 1 5000
Y (converted label, which is reshape_Y(train_y)) 10 5000
a1 400 5000
z2 25 5000
a2 25 5000
z3 10 5000
a3 10 5000
A
Here you can see that the dimension of Y and y is different. This is because the raw data
provided 1, 2, 3, 4 as labels, and the output of the neural net is a vector of length 10, with index 0
corresponds to the probability of digit 0, index 1 corresponds to the probability of digit 1, etc.
(We will describe column vectors using “.T” to indicate the transpose.)
Please answer the following questions.
Q5. Given a column of y: [0, 0, 1, 0, 0, 0, 0, 0, 0, 0].T . From 0 to 9, which digit is this label
corresponding to (1 point):
Q6. Given a column of a3 from a trained neural network with accuracy 99.4%: [0.02, 0.06,
0.1, 0.0004, 0.9, 0.3, 0.1, 0.025, 0.062, 0.12].T . From 0 to 9, which digit does this image
mostly likely represents (1 point):
That’s all for feedforward. You don’t need to implement any part of the feedforward algorithm.
We won’t go into the details in this assignment, but if you are curious about how the difference
between the prediction and the true label is calculated, here is the formula:
𝐽(𝑤, 𝑏) =
1
𝑚
𝑠𝑢𝑚(−𝑌 log(1 − 𝑎3) − (1 − 𝑌) log(1 − 𝑎3))
𝑤ℎ𝑒𝑟𝑒 𝑚 𝑖𝑠 𝑡ℎ𝑒 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑠𝑎𝑚𝑝𝑙𝑒𝑠
Backpropagation
As we described in lecture, the backpropagation algorithm iteratively adjusts the weights in the
network so that the output of the network is a closer match to the desired output. We will start at
the output layer and move backward through the network to assign “blame” and change the
weights to better match our data. We denote the differences at each layer dA*, where * is the
layer, and the updated weights (denoted as dW*) and bias terms (denoted as db*).
At the output layer, dA3 is calculated as:
𝑑𝐴3 = 𝑎3 − 𝑌
At the hidden layer, the dA2, dW2 and db2 are calculated as (* is matrix multiplication, and .* is
element-wise multiplication):
𝑑𝐴2 = (𝑊2. 𝑇 ∗ 𝑑𝐴3) .∗ 𝑠𝑖𝑔𝑚𝑜𝑖𝑑_𝑔𝑟𝑎𝑑𝑖𝑒𝑛𝑡(𝑧2)
𝑑𝑊2 = 𝑑𝐴3 ∗ 𝑎2. 𝑇 / 𝑚
𝑑𝑏2 = 𝑑𝐴3 ∗ 𝑣𝑒𝑐𝑡𝑜𝑟 𝑜𝑓 1𝑠 𝑤𝑖𝑡ℎ 𝑙𝑒𝑛𝑔𝑡ℎ 𝑜𝑓 𝑎2. 𝑠ℎ𝑎𝑝𝑒[1] / 𝑚
At the output layer, dW1 and db1 is calculated as:
𝑑𝑊1 = 𝑑𝐴2 ∗ 𝑎1. 𝑇/ 𝑚
𝑑𝑏1 = 𝑑𝐴2 ∗ 𝑣𝑒𝑐𝑡𝑜𝑟 𝑜𝑓 1𝑠 𝑤𝑖𝑡ℎ 𝑡ℎ𝑒 𝑙𝑒𝑛𝑔𝑡ℎ 𝑜𝑓 𝑎1. 𝑠ℎ𝑎𝑝𝑒[1] / 𝑚
2
4
Please answer the following question.
Q7. Why there is no dA1? Please choose one (1 point):
A. dA1 should exist, but it is just not used in the calculation later, so this value is excluded;
B. If we are going to calculate dA1, that will be the difference between the desired value
and the actual value. The actual value is the image, and we cannot modify the input, so
there is no need to compute dA1.
C. Because the TA made a mistake, there should be dA1.
The formulas provided above are specific to a 3-layer neural network. Please think about how to
generalize it to an n-layer neural network and answer the following questions.
Q8. For the output layer of an n-layer neural network, dAn is calculated the same way,
which is (1 point):
A. 𝒅𝑨𝒏 = 𝒀 + 𝒂𝒏
B. 𝒅𝑨𝒏 = 𝒀 − 𝒂𝒏
C. 𝒅𝑨𝒏 = 𝒂𝒏 − 𝒀
Q9. For a hidden layer i of an n-layer neural network, the dAi is calculated as (1 point):
A. 𝒅𝑨𝒊 = (𝑾𝒊.𝑻 ∗ 𝒅𝑨(𝒊 + 𝟏)) .∗ 𝒔𝒊𝒈𝒎𝒐𝒊𝒅_𝒈𝒓𝒂𝒅𝒊𝒆𝒏𝒕(𝒛𝒊)
B. 𝒅𝑨𝒊 = (𝑾(𝒊 + 𝟏). 𝑻 ∗ 𝒅𝑨(𝒊 + 𝟏)) .∗ 𝒔𝒊𝒈𝒎𝒐𝒊𝒅_𝒈𝒓𝒂𝒅𝒊𝒆𝒏𝒕(𝒛(𝒊 + 𝟏))
C. 𝒅𝑨𝒊 = (𝑾𝒊.𝑻 ∗ 𝒅𝑨(𝒊)) .∗ 𝒔𝒊𝒈𝒎𝒐𝒊𝒅_𝒈𝒓𝒂𝒅𝒊𝒆𝒏𝒕(𝒛𝒊)
Q10. For any layer other than the output layer of an n-layer neural network, the dWi is
calculated as (1 point):
A. 𝒅𝑾𝒊 = 𝒅𝑨𝒊 ∗ 𝒂(𝒊 + 𝟏). 𝑻 / 𝒎
B. 𝒅𝑾𝒊 = 𝒅𝑨(𝒊 + 𝟏) ∗ 𝒂𝒊.𝑻 / 𝒎
C. 𝒅𝑾𝒊 = 𝒅𝑨(𝒊 + 𝟏) ∗ 𝒂𝒊.𝑻
Q11. For any layer other than the output layer of an n-layer neural network, the dbi is
calculated as (1 point):
A. 𝒅𝒃𝒊 = 𝒅𝑨𝒊 ∗ 𝒗𝒆𝒄𝒕𝒐𝒓 𝒐𝒇 𝟏𝒔 𝒘𝒊𝒕𝒉 𝒕𝒉𝒆 𝒍𝒆𝒏𝒈𝒕𝒉 𝒐𝒇 𝒂𝒊. 𝒔𝒉𝒂𝒑𝒆[𝟏] / 𝒎
B. 𝒅𝒃𝒊 = 𝒅𝑨(𝒊 + 𝟐) ∗ 𝒗𝒆𝒄𝒕𝒐𝒓 𝒐𝒇 𝟏𝒔 𝒘𝒊𝒕𝒉 𝒕𝒉𝒆 𝒍𝒆𝒏𝒈𝒕𝒉 𝒐𝒇 𝒂𝒊. 𝒔𝒉𝒂𝒑𝒆[𝟏]/ 𝒎
C. 𝒅𝒃𝒊 = 𝒅𝑨(𝒊 + 𝟏) ∗ 𝒗𝒆𝒄𝒕𝒐𝒓 𝒐𝒇 𝟏𝒔 𝒘𝒊𝒕𝒉 𝒕𝒉𝒆 𝒍𝒆𝒏𝒈𝒕𝒉 𝒐𝒇 𝒂𝒊. 𝒔𝒉𝒂𝒑𝒆[𝟏]/ 𝒎
Now that you understand how the weights are updated, please implement the
corresponding part in the function backpropagation. You will write about 5 lines of code.
B
C
A
B
C
Now let’s take a look at how the weights and biases are updated. The differences dWi and dbi,
could be used to directly update these terms:
𝑊𝑖 = 𝑊𝑖 − 𝑑𝑊𝑖
𝑏𝑖 = 𝑏𝑖 − 𝑑𝑏𝑖
However, we would like to have some control of how the parameters are updated, so we add a
scalar coefficient here called learning rate. As the name indicates, the learning rate describes how
quickly the model learns or update itself. A higher learning rate will learn faster, but also be
more sensitive to noise in the inputs and desired outputs. Our final update rules are:
𝑊𝑖 = 𝑊𝑖 − 𝑙𝑒𝑎𝑟𝑛𝑖𝑛𝑔_𝑟𝑎𝑡𝑒 ∗ 𝑑𝑊𝑖
𝑏𝑖 = 𝑏𝑖 − 𝑙𝑒𝑎𝑟𝑛𝑖𝑛𝑔_𝑟𝑎𝑡𝑒 ∗ 𝑑𝑏𝑖
Using the formula above, please complete the update_parameters function. You will
implement about 3 lines of code.
Train the neural network
Congratulations! Now you have a completed fully-connected neural network. Let’s train the
neural network by running the existing code in the main section under “section 1”. Please do not
modify any of the hyper-parameters here like training_percentage, learning_rate,
num_iterations, etc. It will take a few minutes to run and you will see the cost of every iteration
printed to screen. At the end, it will print the accuracy of this model on the training set and a
figure with the cost of each iteration.
Q12. Please fill in the accuracy (4 points):
Q13. Please copy and paste your figure below (2 points)
0.9572
Training Set and Test Set
Currently all of the dataset is used to train the model. To evaluate a neural network, we usually
separate the dataset into two parts, with the majority of data set allocated as the training set and
the rest being the test set. This ensures the test sets shares some commonality with the training
set (e.g., samples from similar situation, guidelines, etc.) while ensuring that the test samples
have never been used in the training.
To separate the dataset to training set and test set, we will change the training_percentage to 0.8,
meaning 80% of the dataset is randomly selected as the training set and the remaining 20% make
up the test set. Please comment out section 1 and uncomment section 2 and run the code. You
will see the training set accuracy is 0.959 and the testing set accuracy is 0.932. Please note that if
you hard code the total sample size to be 5000 anywhere in the code, you might not get the
correct result. Please go back and fix it.
Q14. Why is the testing set accuracy lower than the training set accuracy? Is it by chance?
(2 points)
Learning Rate
Now let’s play around with the learning rate and set it to a large number, 10. Theoretically, this
neural net should learn very fast. To see whether this is true, please comment section 1 and 2,
and uncomment section 3 and run the code. You will see that the cost stays at 3.250830, and the
accuracy is 0.0934, which is at chance level. This is actually not surprising, and let’s take a look
at why.
In backpropagation, we attempt to find the weights that minimize the cost. Imagine for a moment
that our cost function was a quadratic curve where the x-axis represents one of our weights and
the y-axis represents the associated cost function for that particular weight:
No, it is not by chance.
The reason is that parameters of the NN are updated with only the training set, i.e.,
optimized to minimize the errors between the predictions and true labels of the
training set, not the testing set. Since the NN hasn’t seen the testing set during training,
the testing set accuracy is lower than the training set accuracy.
Backpropagation uses a process called gradient descent to find the x-value that results in the
smallest y value. Suppose the black dot is where we are, then gradient descent attempts to move
down the slope (in the direction of the gradient) toward the nearest function minima. If our
learning rate is small, we take small steps down toward the minima. If we slightly increase our
steps, we will get to the minima faster. However, if the steps we take are too large, we might
“overshoot” the minima and end up with a value that creates worse and worse cost values as the
function “explodes”:
This is what happens when we set the learning rate to the thoroughly unreasonable value of 10.
However, different from the situation above, the cost we obtained stayed at 3.25 and didn’t
“explode” like the diagram shown above.
Q15. Why when we set the learning rate to 10 did the cost stay at 3.25 instead of increasing
much more significantly? (2 points)
Number of Iterations and More layers
Now let’s change the number of iterations to a large number, 30000, and train a neural network
with 4 layers. Please comment out section 1, 2 and 3, and uncomment section 4 and run the
code. It will take longer to complete compared to the previous few sections. Be patient! You
will see that the training set accuracy has improved to 1.0. But the test set accuracy is worse than
before (now 0.931). This is called overfitting.
Q16. Based on your reading, explain why overfitting happens. (2 points)
That is because the learning rate is large enough to overshoot the minima, but is not
large enough to make the cost explode as is shown in the above diagram. In fact, if we
set the learning rate to be, e.g., 100, 300, 500, we will see that the cost does explode
(and oscillates).
The value 10 of the learning rate is somewhere in between such that updates to the
parameters don’t decrease the cost, which in turn prevents the parameters from being
further updated, leading to the cost not to oscillate.
Another possible explanation is that the NN gets stuck in local minima. 10 is enough
for the NN to get into the local minima, but not enough for it to get out from it.
Your own experiment with the hyperparameters
You may have more questions about how the hyperparameters influence the neural network, and
now is the time for you to experiment with the hyperparameters on your own. Please comment
out sections 1 to 4, and uncomment section 5. Please feel free to experiment with the
training_percentage, learning_rate, layers_dims which defines the architecture of the neural
network, and num_iterations. Please document your experiment below.
Q17. What is the purpose of your experiment, that is, what question are you trying to
address? (2 points)
Q18. What are the values of the hyperparameters in your experiment? (2 points)
training_percentage 0.8
learning_rate 0.01, 0.1, 0.5, 5, 10
layers_dims [400, 25, 10]
num_iterations 2000
Q19. Please describe the empirical results of your experiment. Only list empirical facts, not
your conclusions or explanation… that comes next. (2 points)
Q20. What conclusion do you draw from these results? Why do these results happen, and
are there any limitations to your results? (2 point)
Overfitting happens when the model (NN) is excessively complicated (e.g., too many
parameters relative to the training set), therefore learns the noise of the training set as
opposed to the underlying probability distribution.
From the perspective of the bias-variance tradeoff, overfitting corresponds to low bias
and high variance.
Experiment with the learning rate to see how it affects a). convergence of NN, b). time
needed for the NN to converge, and c). training accuracy and testing accuracy.
learning_rate = 0.01, training_accuracy = 0.107, test_accuracy = 0.072
learning_rate = 0.1, training_accuracy = 0.88275, test_accuracy = 0.892
learning_rate = 0.5, training_accuracy = 0.959, test_accuracy = 0.932
learning_rate = 5, training_accuracy = 0.107, test_accuracy = 0.072
learning_rate = 10, training_accuracy = 0.107, test_accuracy = 0.072
Submission
We encourage to type in your answers directly on this handout. Please only put answers in the
designated areas, as we will ignore anything outside those designated areas.
Also please do not alter the format of this file, otherwise we may not able to find your answer at
the right place and you may lose points.
Please submit this file to Problem Set 6 on canvas.
And please submit your code to Problem Set 6 - Programming.
When the learning rate is too small (0.01), parameters of the NN are barely updated so
that the NN learns essentially nothing. Therefore, both the training accuracy and the
testing accuracy are poor.
Increasing the learning rate to 0.1 allows parameters of the NN to be updated,
therefore improving both the training accuracy and the testing accuracy. However,
with 2000 iterations, the accuracies are not as good as when learning_rate = 0.5. The
reason is that small learning rate makes convergence of the NN slower.
When the learning rate is moderate (0.5), parameters of the NN converge to the local
minima within 2000 iterations quickly. Therefore, both training set accuracy and
testing set accuracy are high.
When the learning rate is large (5, 10), the “overshoot” problem happens so that the
NN learns essentially nothing. Therefore, both the training accuracy and the testing
accuracy are poor.
One limitation of the experiment is that with a certain number of iterations (e.g.,
2000), learning rate within a certain range will be equally effective, in a sense that they
all lead to convergence of the NN as well as nice training accuracy and testing
accuracy. However, some learning rate allows the NN to converge faster than others
do. In situations where the speed of convergence matters, this experiment is not
effective to find the optimal (or nearly optimal) learning rate. Grid search on the
learning rate is better suited in such a scenario.