$30
Machine Learning
Homework 3, Version 4
This homework contains 2 questions. The last question requires programming. The maximum number
of points is 100 plus 10 bonus points.
1 Question 1 – Nearest Neighbor Classifiers (40 points)
1.1 1-NN with asymmetric loss (20 points)
Suppose we want to build a binary classifier, but the cost of false positive (predicting positive for a negative
case) is much higher than the cost of false negative (predicting negative for a positive case). One can consider
an asymmetric loss function, where the cost of false negative is 1, while the cost of false positive is α 1.
For a point x, let η(x) be the probability that x is positive.
1.1.1 (3 points)
Show that the optimal Bayes risk for data point x is r
∗
(x) = min{η(x), α(1 − η(x))}.
1.1.2 (4 points)
Let r(x) be the asymptotic risk for data point x, express r(x) in terms of α and η(x).
1.1.3 (10points)
Prove that r(x) ≤ (1 + α)r
∗
(x)(1 − r
∗
(x)). Hint: use α 1
1.1.4 (3 points)
Let R be the asymptotic risk of the 1-NN classifier and R∗ be Bayes risk. Prove that: R ≤ (1+α)R∗
(1−R∗
)
1.2 k-NN classifier (20 points)
Consider a k-NN classifier: classify a point as positive if at least (k+1)/2 nearest neighbors are positive.
1.2.1 (3 points)
Consider drawing k points randomly from a Bernoulli distribution with two outcomes: positive or negative,
and the probability of the point being positive is η. Let g(η, k) be the probability that at least (k + 1)/2 out
of k points are positive. Express the asymptotic risk r(x) for a point x in terms of η(x) and the function
g(·, ·).
1.2.2 (10 points)
Prove that r(x) = r
∗
(x) + (1 − 2r
∗
(x))g(r
∗
(x), k)
1.2.3 (3 points)
Using Hoeffding’s Inequality (https://en.wikipedia.org/wiki/Hoeffding_inequality),
prove that:
g(r
∗
(x), k) ≤ exp(−2(0.5 − r
∗
(x))2
k) (1)
1.2.4 (4 points)
Prove that: r(x) ≤ r
∗
(x) + √
1
2k
. Hint: you should use the above inequality Eq. (1). Note that: from this
result, you can see that the Asymptotic risk of k-NN classifier is the Bayes Risk if k goes to infinity.
1
2 Question 2 – Implementation of Logistic Regression Classifier for k classes
(60 points + 10 bonus)
In this Question, you will implement Logistic Regression using Stochastic Gradient Descent (SGD) optimization. Suppose the training data is {(X1
, Y 1
), · · · ,(Xn
, Y n
)}, where Xi
is a column vector of d
dimensions and Y
i
is the target label. For a column vector X, let X¯ denotes [X; 1], the vector obtained by
appending 1 to the end of X. θ is the set of parameters θ1, θ2, . . . , θk−1. Logistic regression for k classes
assumes the following probability function:
P(Y = i|X; θ) = exp(θ
T
i X¯)
1 + Pk−1
j=1 exp(θ
T
j X¯)
for j = 1, · · · , k − 1, (2)
P(Y = k|X; θ) = 1
1 + Pk−1
i=1 exp(θ
T
j X¯)
. (3)
Logistic regression minimizes the average conditional log likelihood:
L(θ) = −
1
n
Xn
i=1
log(P(Y
i
|X¯i
; θ)). (4)
To minimize this loss function, we can use gradient descent:
θc = θc − η
∂L
∂θc
(5)
where ∂L
∂θc
= −
1
n
Xn
i=1
∂ log(P(Y
i
|X¯i
; θ))
∂θc
(6)
This gradient is computed by enumerating over all training data. It turns out that this gradient can be approximated using a batch of training data. Suppose B is a subset of {1, 2, · · · , n}
∂L
∂θc
≈ −
1
card(B)
X
i∈B
∂ log(P(Y
i
|X¯i
; θ))
∂θc
(7)
This leads to the following stochastic gradient descent algorithm:
Algorithm 1 Stochastic gradient descent for Logistic Regression
1: Inputs: {(Xi
, Yi)}
n
i=1 (for data), m (for batch size), η0, η1 (for step size), max epoch, δ (stopping
criteria)
2: for epoch = 1, 2, ..., max epoch do
3: η ← η0/(η1 + epoch)
4: (i1, ..., in) = permute(1, ..., n)
5: Divide (i1, ..., in) into batches of size m or m + 1
6: for each batch B do
7: Update θ using Eqs. (5) and (7)
8: Break if L(θ
new) (1 − δ)L(θ
old) // not much progress, terminate
9: Outputs: θ.
2
2.1 Derivation (10 points)
Prove that:
∂ log(P(Y
i
|X¯i
; θ))
∂θc
= (δ(c = Y
i
) − P(c|X¯i
; θ))X¯i
. (8)
where δ(c = Y
i
) is the indicator function which takes a value of 1 if the class c equals the ground truth
label Y
i
, and 0 otherwise. Use Equation (8) to derive the gradient of the loss with respect to the parameters
θ1, θ2, . . . , θk−1.
2.2 Crowd Image Classification
In this question of the homework, you will work with image data. We are providing you with the features
extracted from the crowd images, so you do not need to extract features from the raw images. We are also
providing you with the raw images for error analysis and visualization purposes. Your task is to classify an
image into 4 different categories. The data has already been split into train, validation, and test sets.
Dataset details (obtain data from Kaggle competition page)
—- train set (4000 images)
—- val set (2000 images)
—- test set (2000 images)
2.3 Implement Logistic Regression with SGD (50 points + 10 bonus)
Your task is to implement Logistic Regression with k = 4 classes using SGD. You should use Python or
Matlab for your implementation.
1. (15 points) Run your implementation on the provided training data with max epoch = 1000, m =
16, η0 = 0.1, η1 = 1, δ = 0.00001.
(a) Report the number of epochs that your algorithm takes before exiting.
(b) Plot the curve showing L(θ) as a function of epoch.
(c) What is the final value of L(θ) after the optimization?
2. (10 points) Keep m = 16, δ = 0.00001, experiment with different values of η0 and η1. Can you find a
pair of parameters (η0, η1) that leads to faster convergence?
(a) Report the values of (η0, η1). How many epochs does it take? What is the final value of L(θ)?
(b) Plot the curve showing L(θ) as a function of epoch.
3. (10 points) Evaluate the performance on validation data
(a) Plot L(θ) as a function of epoch. On the same plot, show two curves, one for training and one
for validation data.
(b) Plot the accuracy as a function of epoch. On the same plot, show two curves, one for training
and one for validation data.
4. (5 points) With the learned classifier:
(a) Report the confusion matrices on the validation and the training data.
3
2.4 Submit the result in Kaggle (10 points + 10 bonus)
1. (10 points) Run your classifier on the test data and submit the result file on Kaggle (https://www.
kaggle.com/c/cse512hw3). Report the best accuracy you obtain on the test data, as returned by
the Kaggle website.
2. (10 bonus points) Optimize your classifier and try feature engineering to compete for the leading
positions as the top 3 positions will receive the bonus points! You can make use of concepts taught
in class such as regularization, feature normalization to achieve better performance. You can use
validation data to tune your classifier. You can use validation data together with the training data to
train your classifier, once you’ve identified the best combination of hyper-parameters, to generate test
predictions for Kaggle submission. You can be creative with the construction of feature vectors. But
you cannot use deep learning CNN features of the images.
NOTE: Make sure the names displayed on the leader-board (your Kaggle user-names) can be used by
the graders to uniquely identify you (SBU ID is recommended).
3 What to submit?
3.1 Blackboard submission
You will need to submit both your code and your answers to questions on Blackboard. Put the answer file and
your code in a folder named: SUBID FirstName LastName (e.g., 10947XXXX lionel messi). Zip this folder
and submit the zip file on Blackboard. Your submission must be a zip file, i.e, SUBID FirstName LastName.zip.
The answer file should be named: answers.pdf. The first page of the answers.pdf should be the filled cover
page at the end of this homework. The remaining of the answer file should contain:
1. Answers to Question 1.1 and 1.2
2. Answers to Question 2.1 and the requested plots, numbers and confusion matrices for 2.3 and 2.4.
3. For 2.3, also submit separate Python/Matlab code for each of the 4 sub-parts.
4. For 2.4, write the accuracy from kaggle website.
3.2 Prediction submission
For Question 2.4, you must submit a .csv file to get the accuracy through Kaggle. A submission file should
contain two columns: Id and Class. The file should contain a header and have the following format:
Id, Class
1, 1
2, 10
... ...
You can only make 3 submissions per day. In the end, your top-2 entries (you can manually choose
them) will be evaluated on a held-out division of the test set. The final rankings will be the ones as obtained
through this private leader-board.