$30
Homework Assignment 6
COGS 118A: Supervised Machine Learning Algorithms
Instructions: Combine your answer to the questions below with your notebook to
create a PDF file; submit your file via Gradescope.
1 (8 points) Conceptual Questions
1. Which one below best describes what support vectors are:
A. The decision boundary.
B. The positive and the negative planes.
C. The training samples that determine the positive and the negative planes.
D. The test samples that determine the positive and the negative planes.
2. When tuning the hyper-parameters of a model, we find the hyper-parameters
that
A. minimize error on the test set
B. minimize error on the test set
C. maximize the margin of the classifier
D. minimize error on the validation set
3. Select all that apply. k-fold cross validation is
A. Not necessary when you have huge amounts of labelled data
B. A way to do model selection while minimizing over-fitting
1
C. The only way to do hyper-parameter tuning
D. Subject to the bias-variance trade-off
4. When you increase the k in cross-validation while keeping the dataset the same,
you
A. Get a decrease in both the generalization error and the validation error
B. Get an increase both the generalization error and the validation error
C. Get a decrease in the variability of the validation error across folds
D. Get an increase in the variability of the validation error across folds
2 (10 points) Cross-Validation
Given a training dataset Straining = {(xi
, yi)}, i = 1, . . . , 6} where xi ∈ R is the feature
scalar and yi ∈ {−1, +1} is the corresponding label. The data points in the dataset
Straining are given below:
(x1, y1) = (2, −1), (x2, y2) = (7, −1), (x3, y3) = (4, +1),
(x4, y4) = (1, −1), (x5, y5) = (3, +1), (x6, y6) = (6, +1).
Suppose you are training a linear classifier f(x; a, b) = sign(ax + b) with 2-fold crossvalidation where sign(z) is defined as:
sign(z) = ?
1, z ≥ 0
−1, z < 0.
• You have split the dataset Straining into:
S1 = {(x1, y1),(x2, y2),(x3, y3)}
S2 = {(x4, y4),(x5, y5),(x6, y6)}
• After training the classifier f(x; a, b) on S1, you have obtained the parameters
a1 = −1, b1 = 5 and then try to validate the classifier on S2.
• After training the classifier f(x; a, b) on S2, you have obtained the parameters
a2 = 2, b2 = −3 and then try to validate the classifier on S1.
Please finish the tasks below:
1. Calculate the average training error in the 2-fold cross-validation.
Note: The definition of average training error is the mean of the error of
classifier f(x; a1, b1) on S1 and the error of classifier f(x; a2, b2) on S2.
2. Calculate the average validation error (i.e. the cross-validation error) in the
2-fold cross-validation.
3 (12 points) Shattering
In this problem, consider a classifier f(x; a, b) = sign(ax
x − b) where the feature
vector x = [x1, x2]
∈ R
2 and its prediction f(x; a, b) ∈ {−1, +1}. Besides, a, b ∈ R
are the model parameters and sign(z) is defined as:
sign(z) = ?
1, z ≥ 0
−1, z < 0.
The classifier f(x; a, b) performs a binary classification on an input feature vector x
under model parameters a, b ∈ R that can be learned.
In this question, please attempt to show that if the classifier f(x; a, b) can shatter
two points, x1 = [1, 1] and x2 = [−2, −3].
• If you think f(x; a, b) can shatter x1 and x2, you need to show that for each
possible label configuration (y1, y2) ∈ {(+1, +1),(−1, −1),(+1, −1),(−1, +1)}
of x1 and x2, there exists a classifier f(x; a, b) that classifies the x1 and x2
correctly, and you should illustrate each classifier by the following steps:
– Draw the decision boundary of the classifier.
– Shade the area where the classifier makes the positive prediction.
You can use the figures of the coordinate system in the next page to show your
drawings.
• If you think classifier f(x; a, b) cannot shatter x1 and x2, please explain the
reason.
(y1, y2) = (+1, +1) (y1, y2) = (−1, −1)
(y1, y2) = (+1, −1) (y1, y2) = (−1, +1)
4 (10 points) SVM: Gradient
Given a training dataset Straining = {(xi
, yi)}, i = 1, . . . , n}, we wish to optimize the
loss L(w, b) of a linear SVM classifier:
L(w, b) = 1
2
||w||2
2 + C
Xn
i=1
1 − yi(w
T xi + b)
?
+
(1)
where (z)+ = max(0, z) is called the rectifier function and C is a scalar constant.
The optimal weight vector w∗ and the bias b
∗ used to build the SVM classifier are
defined as follows:
w
∗
, b∗ = arg min
w,b
L(w, b) (2)
In this problem, we attempt to obtain the optimal parameters w∗ and b
∗ by using a
standard gradient descent algorithm.
Hint: To derive the derivative of L(w, b), please consider two cases:
(a) 1 − yi(wT xi + b) ≥ 0, (b) 1 − yi(wT xi + b) < 0
1. Derive the derivative: ∂L(w, b)
∂w
.
2. Derive the derivative: ∂L(w, b)
∂b
5 (10 points) SVM: Margin
As shawn in the Figure 3, we have the decision boundary (marked as a black line)
defined as Eq. 3 given w, b:
w
T x + b = 0 (3)
In parallel to the decision boundary, we have the positive plane (marked as a red line)
defined as Eq. 4 and the negative plane (marked as a blue line) defined as Eq. 5:
w
T x + b = +1 (4)
w
T x + b = −1 (5)
We pick an arbitrary point x
− on the negative plane, and draw a purple line that
passes x
− and is perpendicular to the negative plane. The intersection between this
purple line and the positive plane can be denoted as x
+. Thus, we have the following
Eq. 6 that indicates the relation between x
+ and x
−:
x
+ = x
− + λw (6)
where λ ∈ R is an undetermined scalar. The margin M is defined as the distance
between the positive and the negative planes, which can be calculated from Eq. 7:
M = ||x
+ − x
−||2 =
p
< λw, λw (7)
7
SCR©
Computing the margin width
Figure 3: The decision boundary, the positive plane and the negative plane.
Please derive the following according to Eq. 7:
M =
2
√
< w, w
Hint: You can firstly represent λ in the form of w by using Eq. 4, 5, 6.
6 (10 points) K Nearest Neighbors
Consider a training dataset Straining = {(xi
, yi), i = 1, 2, ..., 8} where each data point
(x, y) has a feature vector x = [x1, x2]
and the corresponding label y ∈ {−1, +1}.
The points with the corresponding labels in the dataset are shown in the figure below.
In the figure above, the index i for each training example xi
is given in bold near
the point. You are asked to predict the label of a point xpred = [1, 0] shown in the
figure as a triangle N using k nearest neighbors (k-NN) method under Euclidean
distance.
1. List the indices of all the data points in Straining and their corresponding labels.
2. Determine the predicted label for xpred using the k-NN with different k.
(a) k = 1.
(b) k = 3.
(c) k = 5.
7 (20 points) Coding: K Nearest Neighbors
In this problem, you need to implement the k nearest neighbors (k-NN) algorithm
and apply it to the binary classification. Here we use the modified Iris dataset S =
{(xi
, yi)} where each feature vector x ∈ R
2 and label y ∈ {−1, +1}. You are not
allowed to use sklearn.neighbors.KNeighborsClassifier() in your code, but you
can use it to validate your implementation.
• Load the modified Iris dataset. The dataset S is split to three subsets: The
training set Straining, the validation set Svalidation and the test set Stest. In the
code, we use X train, Y train for the feature vectors and labels of the training
set respectively. Similar notations are also used for the validation and the test
sets.
• Implement k-NN algorithm in 3 steps.
1. For each feature vector x you are predicting a label, you need to calculate
the distances between this feature vector x and all the feature vectors
in the training set Straining.
2. Then sort all distances in ascending order and pick the labels for the k
minimum distances.
3. Count the number of negative labels Ny=−1, and the number of the positive
labels Ny=+1 from k labels picked in step 2. Use the following decision rule
to predict label ˆy for each feature vector x:
yˆ =
?
+1, Ny=−1 < Ny=+1,
−1, Ny=−1 ≥ Ny=+1.
Here we assume Euclidean distance as the distance metric. For more details,
please refer to the code and the corresponding part in the slides.
• Use the validation set to obtain optimal k
∗
. In k-NN, there is a hyper-parameter
k which adjusts the number of nearest neighbors. You would need to perform
a grid search on the following list of k:
k ∈ {1, 2, 3}
For each k, you need to form a k-NN classifier with the training set Straining.
Then, use the classifier to make predictions on the validation set Svalidation and
calculate the error evalidation. We aim to obtain the best hyper-parameter k
∗
corresponding to the minimum validation error e
∗
validation among all ks.
• Use the obtained classifier corresponding to the best hyper-parameter k
∗
to
calculate the test error etest on test set Stest.
Please download the notebook knn.ipynb from the course website and fill in the
missing blanks. Follow the instructions in the skeleton code and report:
1. Your code.
2. Plot of validation set along with decision boundary (implicitly shown in the
background) corresponding to each k.
3. Validation error corresponding to each k.
4. The best hyper-parameter k
∗
corresponding to minimum validation error e
∗
validation.
5. Test error etest corresponding to the best hyper-parameter k
∗
.
8 (20 points) Coding: Decision Tree
In this problem, you need to implement the decision tree algorithm and apply it to
the binary classification. Here we use the Ionosphere dataset S = {(xi
, yi)} where
each feature vector x ∈ R
34 and label y ∈ {−1, +1}. You are allowed to use the
functions from scikit-learn in this question.
• Load the Ionosphere dataset. The dataset S is split to two subsets: The training
set Straining and the test set Stest. In the code, we use X train, Y train for the
feature vectors and labels of the training set respectively. Similar notations are
also used for the test set.
• Train the decision tree classifier with the entropy criterion. In the decision
tree, there is a hyper-parameter D which controls the maximum depth. You
would need to perform a grid search on the following list of D:
D ∈ {1, 2, 3, 4, 5}
For each D, you need to form a decision tree classifier with the training set
Straining. Specifically, you need to conduct a 10-fold cross-validation on Straining
and calculate the cross-validation error ¯e (i.e. average validation error over
the splits in cross-validation). We aim to obtain the best hyper-parameter D∗
corresponding to the minimum cross-validation error e¯
∗ among all Ds.
• Use the obtained classifier corresponding to the best hyper-parameter D∗
to
calculate the test error etest on test set Stest.
Please download the notebook decision-tree.ipynb and the data file ionosphere.npy
from the course website and fill in the missing blanks. Follow the instructions in the
skeleton code and report:
1. Your code.
2. A heatmap of cross-validation errors corresponding to all Ds.
3. The best hyper-parameter D∗
corresponding to the minimum cross-validation
error ¯e
∗
.
4. Test error etest corresponding to the best hyper-parameter D∗
.