CSE474/574 Introduction to Machine Learning
Programming Assignment 1
Handwritten Digits Classification
1 Introduction
In this assignment, your task is to implement a Multilayer Perceptron Neural Network and evaluate its
performance in classifying handwritten digits. For CSE574 students only: You will use the same network to
analyze a more challenging face dataset and compare the performance of the neural network against a deep
neural network using the TensorFlow library.
After completing this assignment, you are able to understand:
• How Neural Network works and use Feed Forward, Back Propagation to implement Neural Network?
• How to setup a Machine Learning experiment on real data?
• How regularization plays a role in the bias-variance tradeoff?
• For CSE574 students only: How to use TensorFlow library to deploy deep neural networks and understand how having multiple hidden layers can improve the performance of the neural network?
To get started with the exercise, you will need to download the supporting files and unzip its contents to
the directory you want to complete this assignment.
Warning: In this project, you will have to handle many computing intensive tasks such as training
a neural network. Our suggestion is to use the CSE server Metallica (this server is dedicated to intensive
computing tasks) and CSE server Springsteen (this boss server is dedicated to running TensorFlow) to
run your computation. YOU MUST USE PYTHON 3 FOR IMPLEMENTATION. In addition, training
such a big dataset will take a very long time, maybe many hours or even days to complete. Therefore,
we suggest that you should start doing this project as soon as possible so that the computer will have
time to do heavy computational jobs.
1.1 File included in this exercise
• mnist all.mat: original dataset from MNIST. In this file, there are 10 matrices for testing set and 10
matrices for training set, which corresponding to 10 digits. You will have to split the training data
into training and validation data.
• face all.pickle: sample of face images from the CelebA data set. In this file there is one data matrix
and one corresponding labels vector. The preprocess routines in the script files will split the data into
training and testing data.
• nnScript.py: Python script for this programming project. Contains function definitions -
– preprocess(): performs some preprocess tasks, and output the preprocessed train, validation and
test data with their corresponding labels. You need to make changes to this function.
1
– sigmoid(): compute sigmoid function. The input can be a scalar value, a vector or a matrix. You
need to make changes to this function.
– nnObjFunction(): compute the error function of Neural Network. You need to make changes to
this function.
– nnPredict(): predicts the label of data given the parameters of Neural Network. You need to make
changes to this function.
– initializeWeights(): return the random weights for Neural Network given the number of unit in
the input layer and output layer.
• facennScript.py: Python script for running your neural network implementation on the CelebA dataset.
This function will call your implementations of the functions sigmoid(), nnObjFunc() and nnPredict()
that you will have to copy from your nnScript.py (For CSE574 students only). You need to make
changes to this function.
• deepnnScript.py: Python script for calling the TensorFlow library for running the deep neural network
(For CSE574 students only). You need to make changes to this function.
1.2 Datasets
Two data sets will be provided. Both consist of images. See the notebook available here - http://nbviewer.
jupyter.org/github/ubdsgroup/ubmlcourse/blob/master/notebooks/ProgrammingAssignment1.ipynb,
for pointers about how to handle the data.
1.2.1 MNIST Dataset
The MNIST dataset [1] consists of a training set of 60000 examples and test set of 10000 examples. All
digits have been size-normalized and centered in a fixed image of 28 × 28 size. In original dataset, each pixel
in the image is represented by an integer between 0 and 255, where 0 is black, 255 is white and anything
between represents different shade of gray.
YOu will need to split the training set of 60000 examples into two sets. First set of 50000 randomly sampled
examples will be used for training the neural network. The remainder 10000 examples will be used as a
validation set to estimate the hyper-parameters of the network (regularization constant λ, number of hidden
units).
1.2.2 CelebFaces Attributes Dataset (CelebA)
CelebFaces Attributes Dataset (CelebA) [3] is a large-scale face attributes dataset with more than 200K
celebrity images. CelebA has large diversities, large quantities, and rich annotations, including:
• 10,177 number of identities,
• 202,599 number of face images, and
• 5 landmark locations, 40 binary attributes annotations per image.
For this programming assignment, we will have provided a subset of the images. The subset will consist of
data for 26407 face images, split into two classes. One class will be images in which the individual is wearing
glasses and the other class will be images in which the individual is not wearing glasses. Each image is a
54 × 44 matrix, flattened into a vector of length 2376.
2 Your tasks
• Implement Neural Network (forward pass and back propagation)
• Incorporate regularization on the weights (λ)
2
Figure 1: Neural network
• Use validation set to tune hyper-parameters for Neural Network (number of units in the hidden layer
and λ).
• For CSE574 students only: Run the deep neural network code we provided and compare the results
with normal neural network. The code will be released by Feb 20th 2017.
• Write a report to explain the experimental results.
3 Some practical tips in implementation
3.1 Feature selection
In the dataset, one can observe that there are many features which values are exactly the same for all data
points in the training set. With those features, the classification models cannot gain any more information
about the difference (or variation) between data points. Therefore, we can ignore those features in the
pre-processing step.
Later on in this course, you will learn more sophisticated models to reduce the dimension of dataset (but
not for this assignment).
3.2 Neural Network
3.2.1 Neural Network Representation
Neural network can be graphically represented as in Figure 1.
As observed in the Figure 1, there are totally 3 layers in the neural network:
• The first layer comprises of (d + 1) units, each represents a feature of image (there is one extra unit
representing the bias).
• The second layer in neural network is called the hidden units. In this document, we denote m + 1
as the number of hidden units in hidden layer. There is an additional bias node at the hidden layer
as well. Hidden units can be considered as the learned features extracted from the original data set.
Since number of hidden units will represent the dimension of learned features in neural network, it’s
our choice to choose an appropriate number of hidden units. Too many hidden units may lead to the
slow training phase while too few hidden units may cause the the under-fitting problem.
3
• The third layer is also called the output layer. The value of l
th unit in the output layer represents
the probability of a certain hand-written image belongs to digit l. Since we have 10 possible digits,
there are 10 units in the output layer. In this document, we denote k as the number of output units
in output layer.
The parameters in Neural Network model are the weights associated with the hidden layer units and the
output layers units. In our standard Neural Network with 3 layers (input, hidden, output), in order to
represent the model parameters, we use 2 matrices:
• W(1) ∈ R
m×(d+1) is the weight matrix of connections from input layer to hidden layer. Each row in
this matrix corresponds to the weight vector at each hidden layer unit.
• W(2) ∈ R
k×(m+1) is the weight matrix of connections from hidden layer to output layer. Each row in
this matrix corresponds to the weight vector at each output layer unit.
We also further assume that there are n training samples when performing learning task of Neural Network.
In the next section, we will explain how to perform learning in Neural Network.
3.2.2 Feedforward Propagation
In Feedforward Propagation, given parameters of Neural Network and a feature vector x, we want to compute
the probability that this feature vector belongs to a particular digit.
Suppose that we have totally m hidden units. Let aj for 1 ≤ j ≤ m be the linear combination of input
data and let zj be the output from the hidden unit j after applying an activation function (in this exercise,
we use sigmoid as an activation function). For each hidden unit j (j = 1, 2, · · · , m), we can compute its
value as follow:
aj =
X
d+1
p=1
w
(1)
jp xp (1)
zj = σ(aj ) = 1
1 + exp(−aj )
(2)
where w
(1)
ji = W(1)[j][p] is the weight of connection from the p
th input feature to unit j in hidden layer. Note
that we do not compute the output for the bias hidden node (m + 1); zm+1 is directly set to 1.
The third layer in neural network is called the output layer where the learned features in hidden units
are linearly combined and a sigmoid function is applied to produce the output. Since in this assignment,
we want to classify a hand-written digit image to its corresponding class, we can use the one-vs-all binary
classification in which each output unit l (l = 1, 2, · · · , 10) in neural network represents the probability of an
image belongs to a particular digit. For this reason, the total number of output unit is k = 10. Concretely,
for each output unit l (l = 1, 2, · · · , 10), we can compute its value as follow:
bl =
mX
+1
j=1
w
(2)
lj zj (3)
ol = σ(bl) = 1
1 + exp(−bl)
(4)
Now we have finished the Feedforward pass.
3.2.3 Error function and Backpropagation
The error function in this case is the negative log-likelihood error function which can be written as follow:
J(W(1), W(2)) = −
1
n
Xn
i=1
X
k
l=1
(yil ln oil + (1 − yil) ln(1 − oil)) (5)
4
where yil indicates the l
th target value in 1-of-K coding scheme of input data i and oil is the output at l
th
output node for the i
th data example (See (4)).
Because of the form of error function in equation (5), we can separate its error function in terms of error
for each input data xi
:
J(W(1), W(2)) = 1
n
Xn
i=1
Ji(W(1), W(2)) (6)
where
Ji(W(1), W(2)) = −
X
k
l=1
(yil ln oil + (1 − yil) ln(1 − oil)) (7)
One way to learn the model parameters in neural networks is to initialize the weights to some random
numbers and compute the output value (feed-forward), then compute the error in prediction, transmits this
error backward and update the weights accordingly (error backpropagation).
The feed-forward step can be computed directly using formula (1), (2), (3) and (4).
On the other hand, the error backpropagation step requires computing the derivative of error function
with respect to the weight.
Consider the derivative of error function with respect to the weight from the hidden unit j to output
unit l where j = 1, 2, · · · , m + 1 and l = 1, · · · , 10:
∂Ji
∂w(2)
lj
=
∂Ji
∂ol
∂ol
∂bl
∂bl
∂w(2)
lj
(8)
= δlzj (9)
where
δl =
∂Ji
∂ol
∂ol
∂bl
= −(
yl
ol
−
1 − yl
1 − ol
)(1 − ol)ol = ol − yl
Note that we are dropping the subscript i for simplicity. The error function (log loss) that we are using
in (5) is different from the the squared loss error function that we have discussed in class. Note that the
choice of the error function has “simplified” the expressions for the error!
On the other hand, the derivative of error function with respect to the weight from the p
th input feature
to hidden unit j where p = 1, 2, · · · , d + 1 and j = 1, · · · , m can be computed as follow:
∂Ji
∂w(1)
jp
=
X
k
l=1
∂Ji
∂ol
∂ol
∂bl
∂bl
∂zj
∂zj
∂aj
∂aj
∂w(1)
jp
(10)
=
X
k
l=1
δlw
(2)
lj (1 − zj )zjxp (11)
= (1 − zj )zj (
X
k
l=1
δlw
(2)
lj )xp (12)
Note that we do not compute the gradient for the weights at the bias hidden node.
After finish computing the derivative of error function with respect to weight of each connection in neural
network, we now can write the formula for the gradient of error function:
∇J(W(1), W(2)) = 1
n
Xn
i=1
∇Ji(W(1), W(2)) (13)
We again can use the gradient descent to update each weight (denoted in general as w) with the following
rule:
w
new = w
old − γ∇J(w
old) (14)
5
3.2.4 Regularization in Neural Network
In order to avoid overfitting problem (the learning model is best fit with the training data but give poor
generalization when test with validation data), we can add a regularization term into our error function to
control the magnitude of parameters in Neural Network. Therefore, our objective function can be rewritten
as follow:
Je(W(1), W(2)) = J(W(1), W(2)) + λ
2n
Xm
j=1
X
d+1
p=1
(w
(1)
jp )
2 +
X
k
l=1
mX
+1
j=1
(w
(2)
lj )
2
(15)
where λ is the regularization coefficient.
With this new objective function, the partial derivative of new objective function with respect to weight
from hidden layer to output layer can be calculated as follow:
∂Je
∂w(2)
lj
=
1
n
Xn
i=1
∂Ji
∂w(2)
lj
+ λw(2)
lj !
(16)
Similarly, the partial derivative of new objective function with respect to weight from input layer to hidden
layer can be calculated as follow:
∂Je
∂w(1)
jp
=
1
n
Xn
i=1
∂Ji
∂w(1)
jp
+ λw(1)
jp !
(17)
With this new formulas for computing objective function (15) and its partial derivative with respect to
weights (16) (17) , we can again use gradient descent to find the minimum of objective function.
3.2.5 Python implementation of Neural Network
In the supporting files, we have provided the base code for you to complete. In particular, you have to
complete the following functions in Python:
• sigmoid: compute sigmoid function. The input can be a scalar value, a vector or a matrix.
• nnObjFunction: compute the objective function of Neural Network with regularization and the gradient
of objective function.
• nnPredict: predicts the label of data given the parameters of Neural Network.
Details of how to implement the required functions is explained in Python code.
Optimization: In general, the learning phase of Neural Network consists of 2 tasks. First task is
to compute the value and gradient of error function given Neural Network parameters. Second task
is to optimize the error function given the value and gradient of that error function. As explained
earlier, we can use gradient descent to perform the optimization problem. In this assignment, you
have to use the Python scipy function: scipy.optimize.minimize (using the option method=’CG’
for conjugate gradient descent), which performs the conjugate gradient descent algorithm to perform
optimization task. In principle, conjugate gradient descent is similar to gradient descent but it chooses
a more sophisticated learning rate γ in each iteration so that it will converge faster than gradient
descent. Details of how to use minimize are provided here: http://docs.scipy.org/doc/scipy-0.
14.0/reference/generated/scipy.optimize.minimize.html.
We use regularization in Neural Network to avoid overfitting problem (more about this will be discussed
in class). You are expected to change different value of λ to see its effect in prediction accuracy in validation
set. Your report should include diagrams to explain the relation between λ and performance of Neural
Network. Moreover, by plotting the value of λ with respect to the accuracy of Neural Network, you should
6
explain in your report how to choose an appropriate hyper-parameter λ to avoid both underfitting and
overfitting problem. You can vary λ from 0 (no regularization) to 60 in increments of 5 or 10.
You are also expected to try different number hidden units to see its effect to the performance of Neural
Network. Since training Neural Network is very slow, especially when the number hidden units in Neural
Network is large. You should try with small hidden units and gradually increase the size and see how it
effects the training time. Your report should include some diagrams to explain relation between number of
hidden units and training time. Recommended values: 4, 8, 12, 16, 20.
4 TensorFlow Library
In this assignment you will only implement a single layer Neural Network. You will realize that implementing
multiple layers can be a very cumbersome coding task. However, additional layers can provide a better
modeling of the data set. The analysis of the challenging CelebA data set will show how adding more layers
can improve the performance of the Neural Network. To experiment with Neural Networks with multiple
layers, we will use Google’s TensorFlow library (https://www.tensorflow.org/).
Your experiments should include the following:
• Evaluate the accuracy of single hidden layer Neural Network on CelebA data set (test data only),
to distinguish between two classes - wearing glasses and not wearing glasses. Use facennScript.py to
obtain these results.
• Evaluate the accuracy of deep Neural Network (try 3, 5, and 7 hidden layers) on CelebA data set (test
data only). Use deepnnScript.py to obtain these results.
• Compare the performance of single vs. deep Neural Networks in terms of accuracy on test data and
learning time.
5 Submission
You are required to submit a single file called proj1.zip using UBLearns.
File proj1.zip must contain 2 folders: report and code.
• Folder report contains your report file (in pdf format). Please indicate the team members and your
course number on the top of the report.
• Folder code must contains the following updated files: nnScript.py and params.pickle1
. File params.pickle
contains the learned parameters of Neural Network. Concretely, file params.pickle must contain the
following variables: optimal n hidden (number of units in hidden layer), w1 (matrix of weight W(1)
as mentioned in section 3.2.1), w2 (matrix of weight W(2) as mentioned in section 3.2.1), optimal λ
(regularization coeffient λ as mentioned in section 3.2.4).2
Using UBLearns Submission: In the groups page of the UBLearns website you will see groups
called “4/574 Project Group x”. Please choose any available group number for your group and join the
group. All project group members must join the same group. Please do not join any other group on
UBLearns that you are not part of. You should submit one solution per group through the groups page.
Project report: The hard-copy of report will be collected in class at due date. Your report should include
the following:
• Explanation of how to choose the hyper-parameters for Neural Network (number of hidden units,
regularization term λ).
1Check this to learn how to pickle objects in Python: https://wiki.python.org/moin/UsingPickle
2
If you want to write more supporting functions to complete the required functions, you should include these supporting
functions and a README file which explains your supporting functions.
7
• For CSE574 students only: Compare the results of deep neural network and neural network with one
hidden layer on the CelebA data set.
6 Grading scheme
The TAs will deploy a testing script that will test the functionality of individual functions that you submit
within the nnScript.py file. Full points will be awarded if the output of the function exactly matches the
expected output. The second grading script will load the params.pickle file that you submit and then test a
small testing data set. You get full points (10) if the accuracy using your model parameters is within ±5%
of the accuracy reported by our code. Note that this data set will not be made available to you.
• For CSE474 students [Total 100 points]:
– Successfully implement Neural Network: 60 points (preprocess() [10 points], sigmoid() [10 points],
nnObjFunction() [30 points], nnPredict() [10 points]).
– Project report: 40 points
∗ Explanation with supporting figures of how to choose the hyper-parameter for Neural Network: 30 points
∗ Accuracy of classification method on the handwritten digits test data: 10 points
• For CSE574 students [Total 120 points]:
– Successfully implement Neural Network: 60 points (preprocess() [10 points], sigmoid() [10 points],
nnObjFunction() [30 points], nnPredict() [10 points]).
– Project report: 60 points
∗ Explanation with supporting figures of how to choose the hyper-parameter for Neural Network: 30 points
∗ Accuracy of classification method on the handwritten digits test data: 10 points
∗ Accuracy of classification method on the CelebA data set: 10 points
∗ Comparison of your neural network with a deep neural network (using TensorFlow) in terms
of accuracy and training time: 10 points
• Students in CSE474 section may attempt the CSE574 requirements (CelebA data analysis, comparison
with deep neural networks) for extra credit
7 Computing Resources
You are allowed to implement the project on your personal computers using Python 3.4 or above. You
will need numpy and scipy libraries. If you need to use departmental resources, you will need to use
metallica.cse.buffalo.edu, which has Python 3.4.3 and the required libraries installed.
Students attempting to use the TensorFlow library have two options:
1. Install TensorFlow on personal machines. Detailed installation information is here - https://www.
tensorflow.org/. Note that, since TensorFlow is a relatively new library, you might encounter installation issues depending on your OS and other library versions. We will not be providing any detailed
support regarding TensorFlow installation. If issues persist, we recommend using the option 2.
2. Use springsteen.cse.buffalo.edu. If you are registered into the class, you should have an account on that server. The server already has Python 3.4.3 and TensorFlow 0.12.1 installed. Please use
/util/bin/python for Python 3. Note that TensorFlow will not work on metallica.cse.buffalo.edu.
8
References
[1] LeCun, Yann; Corinna Cortes, Christopher J.C. Burges. “MNIST handwritten digit database”.
[2] Bishop, Christopher M. “Pattern recognition and machine learning (information science and statistics)”
(2007).
[3] Liu, Ziwei; Luo, Ping; Wang, Xiaogang; Tang, Xiaoou. “Deep Learning Face Attributes in the Wild”,
Proceedings of International Conference on Computer Vision (ICCV) (2015).
9