$29.99
Programming Exercise 3:
Multi-class Classication and Neural Network
1 Multi-class Classication
For this exercise, you will use logistic regression and neural networks to
recognize handwritten digits (from 0 to 9). Automated handwritten digit
recognition is widely used today - from recognizing zip codes (postal codes)
on mail envelopes to recognizing amounts written on bank checks. This
exercise will show you how the methods you've learned can be used for this
classication task.
In the rst part of the exercise, you will extend your previous implemen-
tion of logistic regression and apply it to one-vs-all classication.
1.1 Dataset
You are given a data set in ex3data1.mat that contains 5000 training ex-
amples of handwritten digits.1 The .mat format means that that the data
1This is a subset of the MNIST handwritten digit dataset (http://yann.lecun.com/
exdb/mnist/).
2
has been saved in a native Octave/Matlab matrix format, instead of a text
(ASCII) format like a csv-le. These matrices can be read directly into your
program by using the load command. After loading, matrices of the correct
dimensions and values will appear in your program's memory. The matrix
will already be named, so you do not need to assign names to them.
% Load saved matrices from file
load('ex3data1.mat');
% The matrices X and y will now be in your Octave environment
There are 5000 training examples in ex3data1.mat, where each training
example is a 20 pixel by 20 pixel grayscale image of the digit. Each pixel is
represented by a oating point number indicating the grayscale intensity at
that location. The 20 by 20 grid of pixels is \unrolled" into a 400-dimensional
vector. Each of these training examples becomes a single row in our data
matrix X. This gives us a 5000 by 400 matrix X where every row is a training
example for a handwritten digit image.
X =
2
6664
| (x(1))T |
| (x(2))T |
...
| (x(m))T |
3
7775
The second part of the training set is a 5000-dimensional vector y that
contains labels for the training set. To make things more compatible with
Octave/Matlab indexing, where there is no zero index, we have mapped the
digit zero to the value ten. Therefore, a \0" digit is labeled as \10", while
the digits \1" to \9" are labeled as \1" to \9" in their natural order.
1.2 Visualizing the data
You will begin by visualizing a subset of the training set. In Part 1 of ex3.m,
the code randomly selects selects 100 rows from X and passes those rows
to the displayData function. This function maps each row to a 20 pixel by
20 pixel grayscale image and displays the images together. We have provided
the displayData function, and you are encouraged to examine the code to
see how it works. After you run this step, you should see an image like Figure
1.
3
Figure 1: Examples from the dataset
1.3 Vectorizing Logistic Regression
You will be using multiple one-vs-all logistic regression models to build a
multi-class classier. Since there are 10 classes, you will need to train 10
separate logistic regression classiers. To make this training ecient, it is
important to ensure that your code is well vectorized. In this section, you
will implement a vectorized version of logistic regression that does not employ
any for loops. You can use your code in the last exercise as a starting point
for this exercise.
1.3.1 Vectorizing the cost function
We will begin by writing a vectorized version of the cost function. Recall
that in (unregularized) logistic regression, the cost function is
J() =
1
m
Xm
i=1
y(i) log(h(x(i))) (1 y(i)) log(1 h(x(i)))
:
To compute each element in the summation, we have to compute h(x(i))
for every example i, where h(x(i)) = g(T x(i)) and g(z) = 1
1+ez is the
sigmoid function. It turns out that we can compute this quickly for all our
examples by using matrix multiplication. Let us dene X and as
4
X =
2
6664
| (x(1))T |
| (x(2))T |
...
| (x(m))T |
3
7775
and =
2
6664
0
1
...
n
3
7775
:
Then, by computing the matrix product X, we have
X =
2
6664
| (x(1))T |
| (x(2))T |
...
| (x(m))T |
3
7775
=
2
6664
| T (x(1)) |
| T (x(2)) |
...
| T (x(m)) |
3
7775
:
In the last equality, we used the fact that aT b = bT a if a and b are vectors.
This allows us to compute the products T x(i) for all our examples i in one
line of code.
Your job is to write the unregularized cost function in the le lrCostFunction.m
Your implementation should use the strategy we presented above to calcu-
late T x(i). You should also use a vectorized approach for the rest of the
cost function. A fully vectorized version of lrCostFunction.m should not
contain any loops.
(Hint: You might want to use the element-wise multiplication operation
(.*) and the sum operation sum when writing this function)
1.3.2 Vectorizing the gradient
Recall that the gradient of the (unregularized) logistic regression cost is a
vector where the jth element is dened as
@J
@j
=
1
m
Xm
i=1
(h(x(i)) y(i))x(i)
j
:
To vectorize this operation over the dataset, we start by writing out all
5
the partial derivatives explicitly for all j ,
2
666666664
@J
@0
@J
@1
@J
@2
...
@J
@n
3
777777775
=
1
m
2
666666666664
Pm
i=1
(h(x(i)) y(i))x(i)
0
Pm
i=1
(h(x(i)) y(i))x(i)
1
Pm
i=1
(h(x(i)) y(i))x(i)
2
...
Pm
i=1
(h(x(i)) y(i))x(i)
n
3
777777777775
=
1
m
Xm
i=1
(h(x(i)) y(i))x(i)
=
1
m
XT (h(x) y): (1)
where
h(x) y =
2
6664
h(x(1)) y(1)
h(x(2)) y(2)
...
h(x(1)) y(m)
3
7775
:
Note that x(i) is a vector, while (h(x(i))y(i)) is a scalar (single number).
To understand the last step of the derivation, let i = (h(x(i)) y(i)) and
observe that:
X
i
ix(i) =
2
4
j j j
x(1) x(2) : : : x(m)
j j j
3
5
2
6664
1
2
...
m
3
7775
= XT ;
where the values i = (h(x(i)) y(i)).
The expression above allows us to compute all the partial derivatives
without any loops. If you are comfortable with linear algebra, we encourage
you to work through the matrix multiplications above to convince yourself
that the vectorized version does the same computations. You should now
implement Equation 1 to compute the correct vectorized gradient. Once you
are done, complete the function lrCostFunction.m by implementing the
gradient.
6
Debugging Tip: Vectorizing code can sometimes be tricky. One com-
mon strategy for debugging is to print out the sizes of the matrices you
are working with using the size function. For example, given a data ma-
trix X of size 100 20 (100 examples, 20 features) and , a vector with
dimensions 201, you can observe that X is a valid multiplication oper-
ation, while X is not. Furthermore, if you have a non-vectorized version
of your code, you can compare the output of your vectorized code and
non-vectorized code to make sure that they produce the same outputs.
1.3.3 Vectorizing regularized logistic regression
After you have implemented vectorization for logistic regression, you will now
add regularization to the cost function. Recall that for regularized logistic
regression, the cost function is dened as
J() =
1
m
Xm
i=1
y(i) log(h(x(i))) (1 y(i)) log(1 h(x(i)))
+
2m
Xn
j=1
2
j :
Note that you should not be regularizing 0 which is used for the bias
term.
Correspondingly, the partial derivative of regularized logistic regression
cost for j is dened as
@J()
@0
=
1
m
Xm
i=1
(h(x(i)) y(i))x(i)
j for j = 0
@J()
@j
=
1
m
Xm
i=1
(h(x(i)) y(i))x(i)
j
!
+
m
j for j 1
Now modify your code in lrCostFunction to account for regularization.
Once again, you should not put any loops into your code.
7
Octave Tip: When implementing the vectorization for regularized lo-
gistic regression, you might often want to only sum and update certain
elements of . In Octave, you can index into the matrices to access and
update only certain elements. For example, A(:, 3:5) = B(:, 1:3) will
replaces the columns 3 to 5 of A with the columns 1 to 3 from B. One
special keyword you can use in indexing is the end keyword in indexing.
This allows us to select columns (or rows) until the end of the matrix.
For example, A(:, 2:end) will only return elements from the 2nd to last
column of A. Thus, you could use this together with the sum and .^ op-
erations to compute the sum of only the elements you are interested in
(e.g., sum(z(2:end).^2)). In the starter code, lrCostFunction.m, we
have also provided hints on yet another possible method computing the
regularized gradient.
You should now submit your vectorized logistic regression cost function.
1.4 One-vs-all Classication
In this part of the exercise, you will implement one-vs-all classication by
training multiple regularized logistic regression classiers, one for each of
the K classes in our dataset (Figure 1). In the handwritten digits dataset,
K = 10, but your code should work for any value of K.
You should now complete the code in oneVsAll.m to train one classier for
each class. In particular, your code should return all the classier parameters
in a matrix 2 RK(N+1) , where each row of corresponds to the learned
logistic regression parameters for one class. You can do this with a \for"-loop
from 1 to K, training each classier independently.
Note that the y argument to this function is a vector of labels from 1 to
10, where we have mapped the digit \0" to the label 10 (to avoid confusions
with indexing).
When training the classier for class k 2 f1; :::;Kg, you will want a m-
dimensional vector of labels y, where yj 2 0; 1 indicates whether the j-th
training instance belongs to class k (yj = 1), or if it belongs to a dierent
class (yj = 0). You may nd logical arrays helpful for this task.
8
Octave Tip: Logical arrays in Octave are arrays which contain binary (0
or 1) elements. In Octave, evaluating the expression a == b for a vector a
(of size m1) and scalar b will return a vector of the same size as a with
ones at positions where the elements of a are equal to b and zeroes where
they are dierent. To see how this works for yourself, try the following
code in Octave:
a = 1:10; % Create a and b
b = 3;
a == b % You should try different values of b here
Furthermore, you will be using fmincg for this exercise (instead of fminunc).
fmincg works similarly to fminunc, but is more more ecient for dealing with
a large number of parameters.
After you have correctly completed the code for oneVsAll.m, the script
ex3.m will continue to use your oneVsAll function to train a multi-class clas-
sier.
You should now submit the training function for one-vs-all classication.
1.4.1 One-vs-all Prediction
After training your one-vs-all classier, you can now use it to predict the
digit contained in a given image. For each input, you should compute the
\probability" that it belongs to each class using the trained logistic regression
classiers. Your one-vs-all prediction function will pick the class for which the
corresponding logistic regression classier outputs the highest probability and
return the class label (1, 2,..., or K) as the prediction for the input example.
You should now complete the code in predictOneVsAll.m to use the
one-vs-all classier to make predictions.
Once you are done, ex3.m will call your predictOneVsAll function using
the learned value of . You should see that the training set accuracy is about
94.9% (i.e., it classies 94.9% of the examples in the training set correctly).
You should now submit the prediction function for one-vs-all classica-
tion.
9
2 Neural Networks
In the previous part of this exercise, you implemented multi-class logistic re-
gression to recognize handwritten digits. However, logistic regression cannot
form more complex hypotheses as it is only a linear classier.2
In this part of the exercise, you will implement a neural network to rec-
ognize handwritten digits using the same training set as before. The neural
network will be able to represent complex models that form non-linear hy-
potheses. For this week, you will be using parameters from a neural network
that we have already trained. Your goal is to implement the feedforward
propagation algorithm to use our weights for prediction. In next week's ex-
ercise, you will write the backpropagation algorithm for learning the neural
network parameters.
The provided script, ex3 nn.m, will help you step through this exercise.
2.1 Model representation
Our neural network is shown in Figure 2. It has 3 layers { an input layer, a
hidden layer and an output layer. Recall that our inputs are pixel values of
digit images. Since the images are of size 2020, this gives us 400 input layer
units (excluding the extra bias unit which always outputs +1). As before,
the training data will be loaded into the variables X and y.
You have been provided with a set of network parameters ((1);(2))
already trained by us. These are stored in ex3weights.mat and will be
loaded by ex3 nn.m into Theta1 and Theta2 The parameters have dimensions
that are sized for a neural network with 25 units in the second layer and 10
output units (corresponding to the 10 digit classes).
% Load saved matrices from file
load('ex3weights.mat');
% The matrices Theta1 and Theta2 will now be in your Octave
% environment
% Theta1 has size 25 x 401
% Theta2 has size 10 x 26
2You could add more features (such as polynomial features) to logistic regression, but
that can be very expensive to train.
10
Figure 2: Neural network model.
2.2 Feedforward Propagation and Prediction
Now you will implement feedforward propagation for the neural network. You
will need to complete the code in predict.m to return the neural network's
prediction.
You should implement the feedforward computation that computes h(x(i))
for every example i and returns the associated predictions. Similar to the
one-vs-all classication strategy, the prediction from the neural network will
be the label that has the largest output (h(x))k.
Implementation Note: The matrix X contains the examples in rows.
When you complete the code in predict.m, you will need to add the
column of 1's to the matrix. The matrices Theta1 and Theta2 contain
the parameters for each unit in rows. Specically, the rst row of Theta1
corresponds to the rst hidden unit in the second layer. In Octave, when
you compute z(2) = (1)a(1), be sure that you index (and if necessary,
transpose) X correctly so that you get a(l) as a column vector.
Once you are done, ex3 nn.m will call your predict function using the
loaded set of parameters for Theta1 and Theta2. You should see that the
11
accuracy is about 97.5%. After that, an interactive sequence will launch dis-
playing images from the training set one at a time, while the console prints
out the predicted label for the displayed image. To stop the image sequence,
press Ctrl-C.
You should now submit the neural network prediction function.