$30
Department of Electrical Engineering and Computer Science
EECS 445 Introduction to Machine Learning
Homework 2,
1 Support Vectors [10 pts]
1.1 SVM Primal [4 pts]
Suppose we are looking for a maximum hard margin (no slack variables) linear classifier through the origin,
i.e., b = 0 with binary labels in {−1, 1}. In other words, we minimize ||θ¯||2
2
subject to y
(i) ¯θ · x¯
(i) ≥ 1, i =
1, ..., n
a) (2pts) Given a single training vector x¯ = [a1, a2]
T with label y = −1, what is the ¯θ
∗
that satisfies the
above constrained minimization?
Hint: Try thinking about this problem geometrically.
b) (1pt) Suppose we have two training examples, x¯
(1) = [−2, −3]T
and x¯
(2) = [−4, −3]T with labels
y
(1) = 1 and y
(2) = −1. What is ¯θ
∗
in this case, and what is the margin γ to the support vector?
c) (1pt) How would the classifier and the margin in the previous question change if the offset parameter
b were allowed to be non-zero? What are (
¯θ
∗
, b∗
) and γ in this case?
1.2 SVM Dual [6 pts]
a) (2pt) The dual SVM optimization problem with an offset parameter can be written as:
max
α¯
Xn
i=1
αi −
1
2
Xn
i=1
Xn
j=1
αiαjy
(i)
y
(j)K(¯x
(i)
, x¯
(j)
)
sub. to αi ≥ 0, ∀i
Xn
i=1
αiy
(i) = 0
where the kernel function is K(¯x, x¯
0
) = ¯x · x¯
0
(linear kernel).
Derive the above dual optimization formulation. Note that we include the offset parameter. Start from
the following problem where the order of the maximum and minimum are swapped from the primal
problem.
max
α¯
min
θ,b ¯
1
2
k
¯θk
2 +
Xn
i=1
αi(1 − y
(i)
(
¯θ · x¯
(i) + b))
sub. to αi ≥ 0, ∀i
EECS 445, Fall 2019 – Homework 2, Due: Fri. 10/18 at 11:59pm 2
For the following questions, we will work with the simple case of only two 2-dimensional training
examples x¯
(1)
, x¯
(2) ∈ R
2 with labels y
(1) = 1 and y
(2) = −1. We further assume that
x¯
(1)
=
x¯
(2)
= 1.
b) (1pt) What are the resulting optimal values for the Lagrange multipliers α
∗
1
and α
∗
2
? (Hint: use the
constraints first, write the optimization problem in terms of α
∗
1
alone). Make sure that you use the
condition on the norm of the training examples.
c) (1pt) Using (a), write the optimal parameter values of ¯θ
∗
in terms of the training examples only (x¯
(1)
and x¯
(2))
d) (1pt) What is the offset parameter b
∗
?
e) (1pt) What is the distance (margin γ) from the decision boundary to the support vector?
2 Kernels [8 pts]
If a test vector x¯ is similar to a train vector x¯
(i)
, then we’d like to predict a label y similar to the corresponding
training label y
(i)
. So, we’d like a function (that is, a kernel) k : R
d × R
d → R that tells us how similar are
any two given vectors.
For instance, the dot product, which underlies our linear models, measures how closely two vectors
align. But if another notion of similarity is more appropriate, we might substitute it for every dot product in
the linear model to get a new, nonlinear model. That process of substitution is called kernelization.
In lecture, we kernelized SVMs. Now let’s kernelize the Perceptron.
2.1 From feature mapping to kernel [2 pts]
We have two kernels, K1(¯x, z¯) = ¯x · z¯ and K2(¯x, z¯) = ?, where x, ¯ z¯ ∈ R
2
. The first kernel is defined for
us, but the second kernel is unknown. We can define a third kernel, which is the product of the first two:
K3(¯x, z¯) = K1(¯x, z¯)K2(¯x, z¯). The feature mapping corresponding to K3(¯x, z¯) is given:
φ(¯x) = [x
3
1
,
√
2x
2
1x2, x1x
2
2
, x2
1x2,
√
2x1x
2
2
, x3
2
,
√
6x
2
1
, 2
√
3x1x2,
√
6x
2
2
, 3x1, 3x2 ]
In other words, K3(¯x, z¯) = φ(¯x)φ(¯z). What is K2(¯x, z¯)?
Hint: Group terms by degree and simplify from there.
2.2 Kernelizing the Perceptron [4 pts]
Recall the algorithm for the linear perceptron without offset:
1 Initialization: ¯θ = ¯0
2 Repeat until convergence:
3 for i = 1, . . . , N:
4 if y
(i)
(
¯θ · x¯
(i)
) ≤ 0 then:
5
¯θ ← ¯θ + y
(i)x¯
(i)
6 Classification function of a test point x¯:
7 h(¯x) = sign(¯θ · x¯)
EECS 445, Fall 2019 – Homework 2, Due: Fri. 10/18 at 11:59pm 3
We would like to kernelize this simple perceptron algorithm. This can be done by mapping each example
x¯ into a feature vector φ(¯x) and reformulating the algorithm so that it only uses the kernel function k(¯x, x¯
0
)
corresponding to φ(¯x) · φ(¯x
0
). This will require a number of changes to our original algorithm outlined
above.
(a) (1pt) First, let’s think about how we might update the conditional statement (i.e., line 4, the one that
checks whether or not point x¯
(i)
is misclassified). In the algorithm above, ¯θ · x¯
(i)
is the classifier’s
continuous output. However, once we introduce a kernel, ¯θ may lie in a possibly infinite dimensional
space. So, we need to find a new way to calculate the classifier’s continuous output, one that does not
explicitly depend on ¯θ.
Suppose each x¯
(i)
is misclassified αi
times during training. Given k(¯x
(i)
, x¯) and αi for each i, and
the training data {x¯
(i)
, y(i)} how would you classify the test point x¯?
Hint: Review HW1 problem 4(a).
(b) (2pts) Now, let’s consider the update (line 5 above). We seek to update the parameters of our model.
Again, we can’t update ¯θ explicitly since it may lie in an infinite dimensional feature space. Based on
your answer for part (a), how should we update the parameters?
(c) (1pt) Finally, complete the kernelized perceptron algorithm by updating steps (1), (4), (5), (7) appropriately.
1 Initialization: ________________________________
2 Repeat until convergence:
3 for i = 1, . . . , N
4 if ___________ then:
5 _____________________
6 Classification function of a test point x¯:
7 _____________________
2.3 Multi-class Kernelized Perceptron [2 pts]
Lets consider a variant of the perceptron for a 3-class problem. Each label y can take on a value of 1, 2
or 3. We have n training examples {x¯
(i)
, y(i)}
n
i=1 where x¯
(i) ∈ R
d
and y
(i) ∈ {1, 2, 3}. Our multi-class
perceptron algorithm is as follows:
Initialization: For y ∈ {1, 2, 3} set ¯θy = ¯0
Repeat until convergence:
for i = 1, . . . , n:
set z = argmaxy∈{1,2,3}
¯θy · x¯
(i)
if z 6= y
(i):
¯θy
(i) = ¯θy
(i) + ¯x
(i)
¯θz = ¯θz − x¯
(i)
Classification function of a test point x¯:
h(¯x) = argmaxy∈{1,2,3}
¯θy · x¯
We would like to derive a kernelized version of this algorithm. Assume the kernel we will use is k(¯x
(i)
, x¯
(j)
).
Complete the algorithm below to kernelize the multi-class perceptron algorithm above:
EECS 445, Fall 2019 – Homework 2, Due: Fri. 10/18 at 11:59pm 4
Initialization: For y ∈ {1, 2, 3} and i = 1, . . . , n set αi,y = 0
Repeat until convergence:
for i = 1, . . . , n
set z = _____________________________________ (1pt)
if z 6= y
(i):
_____________________________________
(1pt)
_____________________________________
Classification function of a test point x¯:
h(¯x) = argmaxy∈{1,2,3}
Pn
i=1 αi,yk(¯x
(i)
, x¯)
3 Entropy [2 pts]
Feature Classification
Ave Speed Size of Largest Drop Length of Line Thrilled by Roller Coaster
slow medium decent Yes
fast medium long No
slow small decent No
slow large decent Yes
fast small short No
slow large long Yes
fast medium decent Yes
fast medium short Yes
fast small long No
a) (1pt) Consider the table above, which maps variables pertaining to a customer’s experience to whether
or not they were thrilled by their experiences with roller coaster. We consider a categorical representation (as opposed to an ordinal representation). Which feature(s) result in the highest conditional
entropy H(Y |X), where Y is the outcome and X is one of the features (Ave Speed, Size of Largest
Drop, and Length of Line)?
b) (1pt) Calculate the information gain IG(X, Y ) for each of the features and the outcome. Which
feature(s) result in the highest information gain?
4 Ensemble Methods [5 pts]
We will study the performance of two ensemble methods on the very popular MNIST dataset consisting
of handwritten digits. This dataset contains 70000 samples (each a 28 × 28 grayscale image having 784
features) of handwritten digits, classified into 10 classes (0 through 9). Here, we will consider a subset
of the data pertaining to four classes, which you can fetch using the provided load mnist(classes)
function. Please be aware that due to the relatively large dataset size, the code will take a longer time to run.
Within HW2 ensemble.py, the following functions have been implemented for you:
EECS 445, Fall 2019 – Homework 2, Due: Fri. 10/18 at 11:59pm 5
• load mnist(classes)
• get median performance(X, y, m vals, nsplits=50)
• plot data(bagging scores, random forest scores, m range)
It also contains the following function declarations that you will implement:
• bagging ensemble(X train, y train, X test, y test, n clf=10)
• random forest(X train, y train, X test, y test, m, n clf=10)
Note: If you get one of the following errors or something similar:
– TimeoutError: [WinError 10060] A connection attempt failed because
the connected party did not properly respond after a period of time,
or established connection failed because connected host has failed
to respond.
– ConnectionResetError: [Errno 54] Connection reset by peer
It seems like either the server hosting the data is down or they have removed the API for some reason.
The workaround (summarized below) should work:
1. Find out where sklearn keeps its downloaded data with the following: from sklearn.datasets
import get data home
print(get data home())
2. Download the data file from this link to the directory found from step 1:
https://github.com/amplab/datascience-sp14/blob/master/lab7/mldata/
mnist-original.mat
a) (2pts)Implement random forest(X train, y train, X test, y test, m, n clf=10)
based on the specification provided in the skeleton code. Random forest consists of n clf-many decision trees where each decision tree is trained independently on a bootstrap sample of the training
data (for this problem, n clf=10). For each node, we randomly select m features as candidates for
splitting on.
Here, the final prediction of the bagging classifier is determined by a majority vote of these n clf
decision trees. In the case of ties, randomly sample among the plurality classes (i.e. the classes that
are tied) to choose a label.
You should use the sklearn.tree.DecisionTreeClassifier class. Set criterion=‘entropy’
to avoid the default setting. Also, see the max features parameter within this class.
Note: Do not set the max depth parameter. Remember that you are free to use helper functions to
keep your code organized.
b) (1pt)Implement bagging ensemble(X train, y train, X test, y test, n clf=10)
based on the specification provided in the skeleton code. Like random forest, a bagging ensemble classifier consists of n clf-many decision trees where each decision tree is trained independently on a
bootstrap sample of the training data. However, all features are considered at every split.
Again, the final prediction is determined by a majority vote of these n clf decision trees.
EECS 445, Fall 2019 – Homework 2, Due: Fri. 10/18 at 11:59pm 6
c) (2pts) Now, we will compare the performance of these ensemble classifiers using get median performance().
Measure the median performance across 50 random splits of the digits dataset into training (80%) and
test (20%) sets, and plot the returned performance for each algorithm over the range of m values
specified in the skeleton code (use the plot data() function). See Figure 1 for an example of the
plot (yours may differ due to randomness, but the general trend should be the same). How does the
average test performance of the two methods compare as we vary m (the size of the randomly selected
subset of features)?
Figure 1: Plot of accuracy of random forest, bagging ensemble methods for m = 1, 30, 59, ..., 784
REMEMBER Submit your completed assignment by 11:59pm on Oct. 18, 2019 to Gradescope.