$30
Machine Learning
Homework #2
Reminder: While you are encouraged to discuss problems in a small group (up to 5 people), you should
write your own solutions and source codes independently. In case you worked in a group for the homework,
please include the names of your collaborators in the homework submission. Your answers should be as concise and clear as possible. Please address all questions to http://piazza.com/class#winter2020/eecs545
with a reference to the specific question in the subject line (E.g. RE: Homework 1, Q2(c)).
Submission Instruction: For homework 2, you should submit both solution and source code. We may
inspect your source code submission visaully and run the code to check the results. Please be aware your
points can be deducted if you don’t follow the instructions listed below:
• Submit solution to Gradescope
• Solution should contain your answers to all questions (typed or hand-written).
• Submit source code to Canvas
• Source code should contain your codes to programming questions.
• Source code should be 1 zip file named ‘HW2 yourfirstname yourlastname.zip’
• The zip file should contain 1 ipynb file per question and should be named as ‘qx.ipynb’ (x
is the question number). We do not allow .py files for HW2.
• The zip file should also contain the data folder we provide. Please preserve the directory
structure we set up.
In summary, your source code submission for HW2 should be 1 zip file which includes 3 ipynb files and 3
data folders we provided: q1.ipynb, q2.ipynb, q4.ipynb, q1 data, q2 data, and q4 data, as the programming
questions are q1(a,b,d), q2(b), and q4(a,b,c) in HW2.
Source Code Instruction: Your source code should run under an environment with following libraries:
• Python 3.6+
• Numpy (for implementations of algorithms)
• Matplotlib (for plots)
Please do not use any other library unless otherwise instructed. You should be able to load the data and
execute your code on someone else’s computer with the environment described above. In other words, work
with the setup we provide and do not change the directory structure. Use relative path instead of absolute
path to load the data. Note, the outputs of your source code must match with your solution.
Credits
Stanford CS229 and Bishop PRML.
1
1 [22+4 points] k-Nearest Neighbors (kNN)
Given a labeled dataset D = {(x
(1), y(1)),(x
(2), y(2)), ...,(x
(N)
, y(N)
)}, a KNN algorithm predicts the class
y
(test) of an unseen test data x
(test) by:
1. Finding the k closest examples to x
(test)
from the dataset, i.e. kNN(x
(test)
) = {(x
(1)0
, y(1)0
),(x
(2)0
, y(2)0
), ...,
(x
(N)
0
, y(N)
0
)}, such that x
(1)0
, x
(2)0
, ..., x
(N)
0
are the best k points among all x
(i)
in the training dataset at
minimizing d(x
(i)
0
, x
(test)
), where d(x
(i)
, x
(j)
) is a distance measure between x
(i) and x
(j)
.
2. The predicted class label y
(test)
is the result of a majority vote (Break ties either arbitrarily or randomly):
y
(test) = argmax
y
X
(x
0
,y
0
)∈kNN(x(test))
1[y
0
= y]
Here, you will use with a small portion of MNIST hand-written digits dataset. Given a test example
which is an image of a digit, your end goal is to assign it to the correct class among the 10 classes of MNIST
(0 to 9). Load the file q1 digits.npz which is a dictionary that contains keys digits train, labels train,
digits test, and labels test. We provide q1.ipynb, a starter code for data loading and visualization. All
the implementations of your algorithms should be done in q1.ipynb.
(a) [11 points] For the first 5 examples in digits test, find the 8-Nearest neighbors. Assume that
d(x
(i)
, x(j)
) is the Euclidean distance between the two examples x
(i) and x
(j)
(calculated on the 28
by 28 = 784 pixels). For each of the 5 test digits, write down the indices of the 8 nearest neighbors and
their class labels (the index starts from 0). In addition, submit the images of each of the 5 images and
its 8 nearest neighbors. Implement your code in q1.ipynb.
(b) [8 points] Classify all the test images (1000 examples) into the 10 classes using k = 10. You are only
required to report the classification accuracy in your solution. Implement your code in q1.ipynb
(c) [3 points] Based on your implementation of kNN, what are advantages and disadvantages of kNN? [hint:
what happens if we have over 100,000 training examples?] How does the value of k affect classication
accuracy? [hint: run your code using many different values of k and observe how the accuracy changes.]
(d) [extra 4 points] What are ways that can improve the accuracy of this kNN classifier? For full credits,
implement your ideas in code, report your improved accuracy and implement your idea in q1.ipynb.
[hint: One thing you could do is to come up with a different/better distance function. It could help to
view the examples that are incorrectly classified]
2
2 [27 points] Softmax Regression via Gradient Ascent
Gradient Ascent is an algorithm used to find parameters that maximize a certain expression (contrary to
Gradient Descent, that is used to minimize an expression). For some function f(w), Gradient Ascent finds
w∗ = argmaxw f(w) according to the following pseudo-code:
Algorithm 1 Gradient Ascent
w∗ ← random
repeat
w∗ ← w∗ + α∇wf(w∗
)
until convergence
return w∗
Softmax regression is a multiclass classification algorithm. Given a labeled dataset: D = {(x
(1), y(1)),
(x
(2), y(2)), ...,(x
(N)
, y(N)
)}, where y
(i) ∈ {1, 2, ..., K} (total K classes), Softmax regression computes the
probability that an example x belongs to a class k:
p(y = k|x, w) = exp(wT
k φ(x))
PK
j=1 exp(wT
j φ(x))
Where wk is a weight vector for class k. The above expression is over-parametrized, meaning that there
is more than one unique {w1, w2, ..., wK} that gives identical probability measures for p(y = k|x, w). An
unique solution can be obtained using only K − 1 weight vectors w = {w1, w2, ..., wK−1} and xing wK = 0:
p(y = k|x, w) = exp(wT
k φ(x))
1 + PK−1
j=1 exp(wT
j φ(x))
; ∀k = {1, 2, ..., K − 1}
p(y = K|x, w) = 1
1 + PK−1
j=1 exp(wT
j φ(x))
We define the likelihood of the ith training example p(y
(i)
|x
(i)
, w) as:
p(y
(i)
|x
(i)
, w) = Y
K
k=1
h
p(y
(i) = k|x
(i)
, w)
iI(y
(i)=k)
Where I(.) is the indicator function. We define the likelihood as:
L(w) = Y
N
i=1
p(y
(i)
|x
(i)
, w) = Y
N
i=1
Y
K
k=1
h
p(y
(i) = k|x
(i)
, w)
iI(y
(i)=k)
Finally, we dene the log-likelihood as:
l(w) = log L(w) = X
N
i=1
X
K
k=1
log h
p(y
(i) = k|x
(i)
, w)
iI(y
(i)=k)
(a) [13 points] Derive the gradient ascent update rule for the log-likelihood of the training data. In other
words, derive the expression ∇wml(w) for m = 1, ..., K − 1. Show that:
∇wml(w) = X
N
i=1
φ(x
(i)
)
"
I(y
(i) = m) −
exp(wT
mφ(x
(i)
))
1 + PK−1
j=1 exp(wT
j φ(x(i)))#
3
=
X
N
i=1
φ(x
(i)
)
h
I(y
(i) = m) − p(y
(i) = m|x
(i)
, w)
i
[Hints: log a
b = b log(a). Further, it helps to consider the two cases separately; a case for for y
(i) = k =
m, and another for y
(i) = k 6= m, which is equivalent to using Kronecker delta δkm].
(b) [14 points] Using the gradient computed in part (a), implement Gradient Ascent for Softmax Regression
in q2.ipynb. Use a learning rate α = 0.0005. Load q2 data.npz, which is a dictionary that contains
q2x train, q2y train, q2x test, and q2y test. Train your classifier on the training data and report
the accuracy on the test data in your solution. Implement your code in q2.ipynb. Recall that Softmax
Regression classifies an example x as:
y = argmax
y0
p(y
0
|x, w)
While you must implement your own Softmax Regression from scratch, you can use the logistic regression
function from sklearn (sklearn.linear model.LogisticRegression) to validate your results. Your
results should not be much less than the accuracy of the predictions from sklearn’s LogisticRegression.
4
3 [22 points] Gaussian Discriminate Analysis
Suppose we are given a dataset {(x
(i)
, y(i)
);i = 1, ..., N} consisting of N independent examples, where
x
(i) ∈ RM are M-dimensional vectors, and y
(i) ∈ {0, 1}. We will model the joint distribution of (x
(i)
, y(i)
)
using:
p(y
(i)
) = φ
y
(i)
(1 − φ)
1−y
(i)
p(x
(i)
|y
(i) = 0) = 1
(2π)
M
2 |Σ|
1
2
exp(−
1
2
(x
(i) − µ0)
T Σ
−1
(x
(i) − µ0))
p(x
(i)
|y
(i) = 1) = 1
(2π)
M
2 |Σ|
1
2
exp(−
1
2
(x
(i) − µ1)
T Σ
−1
(x
(i) − µ1))
Here, the parameters of our model are φ, Σ, µ0, and µ1. (Note that while there are two different mean
vectors µ0 and µ1, there is only one covariance matrix Σ.)
(a) [8 points] Suppose we have already fit φ, Σ, µ0, and µ1, and now want to make a prediction at some
new query point x. Show that the posterior distribution of the label (y) at x takes the form of a logistic
function, and can be written as
p(y = 1|x; φ, Σ, µ0, µ1) = 1
1 + exp(−wT x)
where w is a function of φ, Σ, µ0, and µ1. [Hint: For part (a) only, you may want to redefine x to be
M + 1 dimensional vectors by adding an extra coordinate x0 = 1.] [Hint 2: wT x is a scalar.]
(b) [8 points] When M (the dimension of x
(i)
) is 1, then Σ =
σ
2
becomes a scalar, and its determinant
is |Σ| = σ
2
. Accordingly, the maximum likelihood estimates of the parameters are given by
φ =
1
N
X
N
i=1
1{y
(i) = 1}
µ0 =
PN
i=1 1{y
(i) = 0}x
(i)
PN
i=1 1{y
(i) = 0}
µ1 =
PN
i=1 1{y
(i) = 1}x
(i)
PN
i=1 1{y
(i) = 1}
Σ = 1
N
X
N
i=1
(x
(i) − µy(i) )(x
(i) − µy(i) )
T
.
The log-likelihood of the data is
`(φ, µ0, µ1, Σ) = logY
N
i=1
p(x
(i)
, y(i)
; φ, µ0, µ1, Σ)
= logY
N
i=1
p(x
(i)
|y
(i)
; φ, µ0, µ1, Σ)p(y
(i)
; φ)
By maximizing ` with respect to the four parameters, prove that the maximum likelihood estimates of
φ, Σ, µ0, and µ1 are indeed the ones described above. (You may assume that there is at least one positive
and one negative example, so that the denominators in the definitions of µ0 and µ1 above are non-zero.)
(c) [6 points] Redo part (b) for M 6= 1, excluding the maximum likelihood estimate for Σ.
5
4 [29 points] Naive Bayes for classifying SPAM
In this problem, we will use the naive Bayes algorithm to build a SPAM classifier.
In recent years, email spam has been an increasing problem. Here, we’ll build a classifier to distinguish
between “real (non-spam)” emails, and spam emails. For this problem, we obtained a set of spam emails,
and a set of genuine newsgroup messages. Using only the subject line and body of each message, the classifier
will learn to distinguish between the two groups of emails.
In our data, the text emails have been pre-processed so that they can be used for naive Bayes. The preprocessing ensures that only the email body and subject remain in the dataset; email addresses (EMAILADDR),
web addresses (HTTPADDR), currency (DOLLAR) and numbers (NUMBER) were also replaced by the special tokens to allow them to be considered properly in the classication process. (In this problem, we’ll going
to call the features “tokens”.) For whomever interested in the pre-processing, two examples for spam emails
and their pre-processed forms and one example for non-spam email and its pre-processed form are in the
folder samples FYI.
We have done the feature extraction works for you, so you can just load the data matrices (called
document-term matrices in text classication) which contain all the data. In a document-term matrix, the
i
th row represents the i
th document/email, and the j
th column represents the j
th distinct token. Thus, the
(i, j) entry of this matrix represents the number of occurrences of the jth token in the ith document.
For this problem, we chose the set of tokens (vocabulary) to only contain the medium frequency tokens,
as the tokens that occur too often or too rarely do not have much classication value. (Examples: tokens
that occur very often are terms like “the,” “and,” and “of,” which occur in any spam and non-spam emails.)
Also, terms were stemmed using a standard stemming algorithm; basically, this means that “price,” “prices”
and “priced” have all been replaced with “price,” so that they can be treated as the same token. For a list
of the tokens used, see the le TOKENS LIST.txt in the samples FYI folder.
Since the document-term matrix is sparse (has lots of zero entries), we store it in an efficient format to
save space. We provide a starter code q4.ipynb which contains the readMatrix function that reads in the
document-term matrix, the correct class labels for all emails, and the full list of tokens.
(a) [12 points] Implement a naive Bayes classifier for spam classification, using the multinomial event
model and Laplace smoothing. You should use functions provided in q4.ipynb to train your parameters
with the training dataset MATRIX.TRAIN. Then, use these parameters to classify the test dataset
MATRIX.TEST and report the error using the evaluate function. Implement your code in q4.ipynb.
Remark. If you implement naive Bayes in the straightforward way, you’ll note that the computed
p(x|y) = Q
j
p(xj |y) often equals zero. This is because p(x|y), which is the product of many numbers
less than one, is a very small number. The standard computer representation of real numbers cannot
handle numbers that are too small, and instead rounds them off to zero. You’ll have to find a way to
compute naive Bayes’ predicted class labels without explicitly representing very small numbers such as
p(x|y). [Hint: Think about using logarithms.]
(b) [8 points] Some tokens may be particularly indicative of an email being spam or non-spam. One way
to measure how indicative a token i is for the SPAM class by looking at:
log(p(xj = i|y = 1)
p(xj = i|y = 0)) = log( P(tokeni
|email is SPAM)
P(tokeni
|email is NOTSPAM))
Using the parameters you obtained in part (a), find the 5 tokens that are most indicative of the SPAM
class (i.e., 5 tokens that have the highest positive value on the measure above). Implement your code in
q4.ipynb.
(c) [9 points] Repeat part (a), but with different naive Bayes classifiers each trained with different training
sets MATRIX.TRAIN.*. Evaluate the classifiers with MATRIX.TEST and report the classification
error for each classifier. Plot a graph of the test error with respect to size of training sets. Which trainingset size gives you the best classification error? Implement your code in q4.ipynb.
6