$30
Machine Learning Homework 2
Comp540 The code base hw3.zip for the
assignment is an attachment to Assignment 3 on Canvas. You will add your code at the
indicated spots in the files there. Place your answers to Problems 1, 2, 3, 4, 5 (typeset) in a
file called writeup.pdf. Please follow the new submission instructions. Set up a group for
yourself if you haven’t already or your group composition has changed. When you submit,
please submit the following THREE items as separate attachments before the due date and
time:
• Your writeup pdf, named writeup.pdf.
• Your jupyter notebooks, saved in HTML format. If there are multiple notebooks, submit
each one separately.
• The zip file containing your work (code, etc.). Be sure that the datasets you use are
NOT part of the zip, as this will increase the size considerably.
1 MAP and MLE parameter estimation (10 points)
Consider a data set D = {x
(i)
|1 ≤ i ≤ m} where x
(i)
is drawn from a Bernoulli distribution
with parameter θ. The elements of the data set are the results of the flips of a coin where
x
(i) = 1 represents heads and x
(i) = 0 represents tails. We will estimate the parameter θ
which is the probability of the coin coming up heads using the data set D.
• (5 points) Use the method of maximum likelihood estimation to derive an estimate for
θ from the coin flip results in D.
• (5 points) Assume a beta prior distribution on θ with hyperparameters a and b. The
beta distribution is chosen because it has the same form as the likelihood function for
D derived under the Bernoulli model (such a prior is called a conjugate prior).
Beta(θ|a, b) ∝ θ
a−1
(1 − θ)
b−1
Derive the MAP estimate for θ using D and this prior distribution. Show that under
a uniform prior (Beta distribution with a = b = 1), the MAP and MLE estimates of θ
are equal.
2 Logistic regression and Gaussian Naive Bayes (15 points)
Consider a binary classification problem with dataset D = {(x
(i)
, y(i)
)|1 ≤ i ≤ m; x
(i) ∈ <d
,
y
(i) ∈ {0, 1}}. You will derive a connection between logistic regression and Gaussian Naive
1
2 Bayes for this classification problem. For logistic regression, we use the sigmoid function
g(θ
T x), where θ ∈ <d+1 and we augment x with a 1 in front to account for the intercept
term θ0. For the Gaussian Naive Bayes model, assume that the y’s are drawn from a
Bernoulli distribution with parameter γ, and that each xj
from class 1 is drawn from a
univariate Gaussian distribution with mean µ
1
j
and variance σj
2
, and each xj
from class 0
is drawn from a univariate Gaussian distribution with mean µ
0
j
and variance σj
2
. Note that
the variance is the same for both classes, just the means are different.
• (2 points) For logistic regression, what is the posterior probability for each class, i.e.,
P(y = 1|x) and P(y = 0|x)? Write the expression in terms of the parameter θ and the
sigmoid function.
• (5 points) Derive the posterior probabilities for each class, P(y = 1|x) and P(y = 0|x),
for the Gaussian Naive Bayes model, using Bayes rule, the (Gaussian) distribution on
the xj
’s, j = 1, . . . , d, and the Naive Bayes assumption.
• (8 points) Assuming that class 1 and class 0 are equally likely (uniform class priors),
simplify the expression for P(y = 1|x) for Gaussian Naive Bayes. Show that with
appropriate parameterization, P(y = 1|x) for Gaussian Naive Bayes with uniform
priors is equivalent to P(y = 1|x) for logistic regression.
3 Reject option in classifiers (10 points)
In many classification problems one has the option either of assigning x to class j or, if you
are too uncertain, of choosing the reject option. If the cost for rejects is less than the cost of
falsely classifying the object, it may be the optimal action. Let αi mean you choose action
i, for i = 1, . . . , C + 1, where C is the number of classes and C + 1 is the reject action. Let
y = j be the true (but unknown) label of x. Define the loss function as follows
L(αi
| y = j) =
0 if i = j and i, j ∈ {1, . . . , C}
λr if i = C + 1
λs otherwise
In other words, you incur 0 loss if you correctly classify, you incur λr loss (cost) if you choose
the reject option, and you incur λs loss (cost) if you make a misclassification error.
• (5 points) Show that the minimum risk is obtained if we decide y = j if p(y = j | x) ≥
p(y = k | x) for all k (i.e., j is the most probable class) and if p(y = j | x) ≥ 1 −
λr
λs
,
otherwise we decide to reject.
• (5 points) Describe qualitatively what happens as λr/λs is increased from 0 to 1 (i.e.,
the relative cost of rejection increases).
3 4 Kernelizing k-nearest neighbors (5 points)
The nearest neighbor classifier assigns a new vector x ∈ <d
to the same class as that of the
nearest neighbor x
(i)
in the training set D = {(x
(i)
, y(i)
) | x
(i) ∈ Red
, y(i) ∈ {−1, 1}} where the
distance between two points in <
d
is the Euclidean distance. Re-express this classification
rule in terms of dot products and then make use of the kernel trick to formulate the knearest neighbor algorithm for a general non-linear kernel function K satisfying the Mercer
condition.
5 Constructing kernels (10 points)
Show that given valid kernels k1(x, x
0
) and k2(x, x
0
) defined over x, x
0
∈ <d
, the following
new kernels will also be valid.
• (2 points) k(x, x
0
) = ck1(x, x
0
)
• (4 points) k(x, x
0
) = f(x)k1(x, x
0
)f(x
0
) where f(x) is any function on x ∈ <d
.
• (4 points) k(x, x
0
) = k1(x, x
0
) + k2(x, x
0
)
6 One vs all logistic regression (15 points)
In this problem, you will implement a one-vs-all (OVA) logistic classifier and apply it to
a version of the CIFAR-10 object recognition dataset. The material for this problem is in
hw3.zip and Table 1 contains the contents of the folder created by unzipping it.
Download the data
Open up a terminal window and navigate to the datasets folder inside the hw3 folder.
Run the get datasets.sh script. On my Mac, I just type in ./get datasets.sh at the
shell prompt. A new folder called cifar 10 batches py will be created and it will contain
50,000 labeled images for training and 10,000 labeled images for testing. The function further
partitions the 50,000 training images into a train set and a validation set for selection of hyper
parameters. We have provided a function to read this data in data utils.py. Each image
is a 32 × 32 array of RGB triples. It is preprocessed by subtracting the mean image from
all images. We flatten each image into a 1-dimensional array of size 3072 (i.e., 32 × 32 × 3).
Then a 1 is appended to the front of that vector to handle the intercept term. So the training
set is a numpy matrix of size 49000 × 3073, the validation set is a matrix of size 1000 × 3073
and the set-aside test set is of size 10000 × 3073.
Problem 6.1. Implementing a one-vs-all classifier for the CIFAR-10 dataset (10
points)
In this part of the exercise, you will implement one-vs-all classification by training multiple
regularized logistic regression classifiers, one for each of the ten classes in our dataset. You
Name Edit? Read? Description 4
softmax cifar.ipynb Yes Yes Python notebook to run softmax regression on CIFAR-10 dataset
ova cifar.ipynb Yes Yes Python notebook to run OVA logistic
regression on CIFAR-10 dataset
utils.py No Yes contains functions for loading data, and
for visualizing CIFAR-10 data.
one vs all.py Yes Yes Class and methods defining the
one vs all classifier
softmax.py Yes Yes Class and methods defining the softmax
classifier
linear classifier.py Yes Yes Class and methods for defining minibatch gradient descent
gradient check.py No Yes functions for checking the analytical
gradient numerically
datasets No Yes directory containing shell script to
download CIFAR-10 data
hw3.pdf No Yes this document
Table 1: Contents of hw3
should now complete the code in one vs all.py to train one classifier for each class. In
particular, your code should return all the classifier parameters in a matrix Θ ∈ <3073×K,
where each column of Θ corresponds to the learned logistic regression parameters for one
class. You can do this with a for-loop from 0 to K −1, training each classifier independently.
When training the classifier for class k ∈ {0, . . . , K − 1}, you should build a new label for
each example x as follows: label x as 1 if x belomgs to class k and zero otherwise. You can
use sklearn’s logistic regression function to learn each classifier.
Problem 6.2: Predicting with a one-vs-all classifier (5 points)
After training your one-vs-all classifier, you can now use it to predict the labels of the
set aside test images. For each image in the test set, you should compute the probability
that it belongs to each class using the trained logistic regression classifiers. Your one-vs-all
prediction function will pick the class for which the corresponding logistic regression classifier
outputs the highest probability and return the class label as the prediction for that input
example. You should now complete code in the predict method one vs all.py. Once you
are done, our notebook script ova cifar.ipynb will make predictions on a set aside test set
and print out the confusion matrix. The next cell in the notebook will visualize the learned
coefficients for each of the 10 one-vs-rest classifiers.
Comparing your OVA classifier with sklearn’s classifier 5
The last two cells in the ova cifar.ipynb notebook setup sklearn’s OVA classifier and
visualize the coefficients learned by that classifer. Compare what you obtain with your
version of the OVA classifier against sklearn’s implementation.
7 Softmax regression (45 points)
In this problem, you will implement a softmax classifier and apply it to a version of the
CIFAR-10 object recognition dataset. You will need to compare the performance of this
classifier with the OVA classifier and comment on which approach is better for the CIFAR10 dataset and why. There are 10 extra credit points in this problem for experimenting with
hyperparameters and optimization method.
Problem 7.1: Implementing the loss function for softmax regression (naive version) (5 points)
Softmax regression generalizes logistic regression to classification problems where the class
label y can take on more than two possible values. This is useful for such problems as
music genre classification and object recognition, where the goal is to distinguish between
more than two different music genres or more than two different object categories. Softmax
regression is a supervised learning algorithm, but we will later be using it in conjunction
with deep learning and unsupervised feature learning methods. Recall that we are given a
data set
D = {(x
(i)
, y(i)
)|1 ≤ i ≤ m; x
(i) ∈ <d+1; x0
(i) = 1, y(i) ∈ {1, . . . , K}}, K 2
Our probabilistic model hθ(x) is defined as
hθ(x) =
P(y = 1|x; θ)
P(y = 2|x; θ)
. . .
P(y = K|x; θ)
where
P(y = k|x; θ) = exp(θ
(k)
T
x)
PK
j=1 exp(θ
(j)
T
x)
The parameter θ is a (d+1)×K matrix, where each column represents the parameter vector
for class k = 1, . . . , K.
θ =
| | . . . |
| | . . . |
θ
(1) θ
(2) . . . θ(K)
| | . . . |
| | . . . |
Hint: numerical stability issues can come up in the computation of 6 P(y = k|x; θ). Consider K=3, and θ
T x = [123, 456, 789]. To compute P(y = k|x; θ) from these scores, we
need to calculate exp(123), exp(456) and exp(789), and sum them. These are very large
numbers. However, we can get the same probabilities by subtracting the maximum (789)
from every element in θ
T x. Then we have the vector [−666, −333, 0], and we can calculate exp(−666), exp(−333) and exp(0), sum them (call the sum S) and then calculate
exp(−666)/S, exp(−333/S) and exp(0)/S.
The cost function J(θ) for softmax regression is derived from the negative log likelihood of
the data D, assuming that P(y|x; θ) = hθ(x) as defined above.
J(θ) = −
1
m
Xm
i=1
X
K
k=1
I{y
(i) = k}log exp(θ
(k)
T
x
(i)
)
PK
j=1 exp(θ
(j)
T
x
(i)
)
+
λ
2m
X
d
j=0
X
K
k=1
θ
(k)
j
2
where I{c} is the indicator function which evaluates to 1 when c is a true statement and
to 0 otherwise. The second term is a regularization term, where λ is the regularization
strength. While it is customary to exclude the bias term in L2 regularization, we include it
here because it does not make a huge difference in the final result. You can check this for
yourself on the CIFAR-10 dataset. You should implement this loss function using for loops
for the summations in the function softmax loss naive in softmax.py. Once you have the
loss function implemented, a cell in the notebook softmax cifar.ipynb will run your loss
function for a randomly initialized θ matrix with 49000 training images and labels with λ
set to 0. You should expect to see a value of about −loge(0.1) (Why?).
Problem 7.2: Implementing the gradient of loss function for softmax regression
(naive version) (5 points)
The derivative of the loss function J(θ) with respect to the θ
(k)
is
∇θ
(k)J(θ) = −
1
m
Xm
i=1
[x
(i)
(I{y
(i) = k} − P(y
(i) = k|x
(i)
; θ))] + λ
m
θ
(k)
Implement the analytical derivative computation in softmax loss naive in softmax.py.
Checking your gradient implementation with numerical gradients
In the previous problem you have implemented the analytical gradient computed using calculus. We check your implementation of the gradient using the method of finite differences.
The functions in gradient check.py compute the numerical gradient of a function f as
follows:
∂f(x)
∂x =
f(x + h) − f(x − h)
2h
7
plane car bird cat deer dog frog horse ship truck
Figure 1: A sample of the CIFAR-10 dataset.
8 for a very small h. A cell in the softmax cifar.ipynb notebook will check your gradient
against the numerically approximated gradient – you should expect to see differences between
the two gradients of the order of 10−7 or less.
Problem 7.3: Implementing the loss function for softmax regression (vectorized
version) (10 points)
Now complete the function softmax loss vectorized in softmax.py to implement the loss
function J(θ) without using any for loops. Re-express the computation in terms of matrix
operations on X, y and θ.
Problem 7.4: Implementing the gradient of loss for softmax regression (vectorized version) (5 points)
Now vectorize the gradient computation in softmax loss vectorized in softmax.py. Once
you complete this, a cell in softmax cifar.ipynb will run and time your naive and vectorized
implementations – you should expect to see at least one order of magnitude difference in run
time between the two implementations.
Problem 7.5: Implementing mini-batch gradient descent (5 points)
In large-scale applications, the training data can have millions of examples. Hence, it seems
wasteful to compute the loss function over the entire training set in order to perform only a
single parameter update. A very common approach to addressing this challenge is to compute
the gradient over batches of the training data. For example, a typical batch contains 256
examples from a training set of over 1.2 million. This batch is then used to perform a
parameter update:
θ
(k) → θ
(k) − α∇θ
(k)J(θ)
where α is the step size or learning rate for gradient descent.
Implement mini-batch gradient descent in the method train in softmax.py using the description provided in the documentation of the method. You can set the verbose argument
of train to be True and observe how the loss function varies with iteration number.
Problem 7.6: Using a validation set to select regularization lambda and learning
rate for gradient descent (5 points)
There are many hyper parameters to be selected for mini batch gradient descent – the batch
size, the number of iterations, and the learning rate. For the loss function, we also need to
select λ, the regularization strength. In this exercise, we have pre-selected a batch size of 400
and an iteration count of 4000. Now, use the validation set provided to sweep the learning
9
plane car bird cat deer
dog frog horse ship truck
Figure 2: Visualization of θ for the CIFAR-10 dataset. The color scale is from red (low
values) to blue (high values).
rate and the λ parameter space, using the suggested values in softmax cifar.ipynb to find
the best combination of these two hyper parameters. Fill in the code TODO in the marked
cell in softmax cifar.ipynb.
Problem 7.7: Training a softmax classifier with the best hyperparameters (5
points)
Once you find the best values of λ and learning rate, insert code in the notebook softmax cifar.ipynb
to train a softmax classifier on the training data with the best hyper parameters and save
this classifier in the variable best softmax. The classifier will be evaluated on the set aside
test set and you should expect to see overall accuracy of over 35%.
Visualizing the learned parameter matrix
We can remove the bias term from the θ matrix and reshape each column of the matrix
which is a parameter vector of size 3072 back into an array of size 32 × 32 × 3 and visualize
the results as an image. softmax cifar.ipynb constructs this plot and displays it inline.
You should see a plot like the one shown in Figure 2. Compute the confusion matrix (you
can use the sklearn function) on the test set for your predictor and interpret the visualized
coefficients in the light of the errors made by the classifier.
Extra credit: Problem 7.8: Experimenting with other hyper parameters and10
optimization method (10 points)
We chose a batch size of 400 and 4000 iterations for our previous experiments. Explore
larger and smaller batch sizes, choosing an appropriate number of iterations (by specifying a
tolerance on differences in values of the loss function or its gradient in successive iterations)
with the validation set. Produce plots that show the variation of test set accuracy as a
function of batch size/number of iterations. You will have to determine the right settings
for regularization strength λ and learning rate for each batch size/ number of iterations
combination. What is the best batch size/number of iterations/learning rate/regularization
strength combination for this problem? What is the best test set accuracy that can be
achieved by this combination?
Problem 7.9. Comparing OVA binary logistic regression with softmax regression
(5 points)
Compare the performance results from your OVA and softmax regression classifier. Provide
a table with classification performance of each classifier on each CIFAR-10 category. Explain
performance differences, if any, or explain why their performance is similar, if it is. Which
approach would you recommend for the CIFAR-10 classification problem? Place answers to
these questions in writeup.pdf.