$30
CM146,
Problem Set 4: Kernel, SVM, K-Means, GMM
Submission Instructions
• Submit your solutions electronically on the course Gradescope site as PDF files.
• If you plan to typeset your solutions, please use the LaTeX solution template. If you must
submit scanned handwritten solutions, please use a black pen on blank white paper and a
high-quality scanner app.
• For questions involving math and derivation, please provide important intermediate steps and
box the final answer clearly.
• Please export the notebook to a .py file by clicking the “File” → “Download.py” and upload
to CCLE.
Your code should be commented appropriately. The most important things:
– Your name and the assignment number should be at the top of each file.
– Each class and method should have an appropriate doctsring.
– If anything is complicated, it should include some comments.
There are many possible ways to approach the programming portion of this assignment, which
makes code style and comments very important so that staff can understand what you did.
For this reason, you will lose points for poorly commented or poorly organized code.
• Even though this assignment is mainly calling sklearn packages, please copy-paste your code
as part of the solution in the submission along with all the plots and the rest of the solutions
(math derivation, justifications to the plots) to Gradescope
1
1 Kernels [30 pts]
One way to construct complex kernels is to build from simple ones. In the following, we will prove
a list of properties of kernels (Rule 1 – Rule 5) and then use these rules to show the Gaussian kernel
k(x, z) = exp( −kx−zk
2
σ2 ) is a valid kernel.
(a) (5 pts) Rule 1: Suppose we have x ∈ X, z ∈ X, g : X → R. Prove that k(x, z) = g(x) × g(z)
is a valid kernel by constructing a feature map Φ(·) and show that k(x, z) = Φ(x)
T Φ(z).
(b) (5 pts) Rule 2: Suppose we have a valid kernel k1(x, z) = Φ1(x)
T Φ1(z). Prove that
k(x, z) = α · k1(x, z) ∀α ≥ 0 is also a valid kernel by constructing a new feature map
Φ(·) using Φ1(·) and show that k(x, z) = Φ(x)
T Φ(z).
(c) (5 pts) Rule 3: Suppose we have two valid kernels k1(x, z) = Φ1(x)
T Φ1(z) and k2(x, z) =
Φ2(x)
T Φ2(z). Prove that k(x, z) = k1(x, z) + k2(x, z) is also a valid kernel by constructing a
new feature map Φ(·) using Φ1(·) and Φ2(·) and show that k(x, z) = Φ(x)
T Φ(z).
(d) (5 pts) Rule 4: Suppose we have two valid kernels k1(x, z) = Φ1(x)
T Φ1(z) and k2(x, z) =
Φ2(x)
T Φ2(z). Prove that k(x, z) = k1(x, z) × k2(x, z) is a valid kernel by constructing a new
feature map Φ(·) using Φ1(·) and Φ2(·) and show that k(x, z) = Φ(x)
T Φ(z).
(e) (5 pts) Rule 5: Suppose we have a valid kernel k1(x, z) = Φ1(x)
T Φ1(z). Prove that
exp(k1(x, z)) is also a valid kernel by applying Rules 1-4 in problem 1(a)-1(d).
Hint:
exp(k1(x, z)) = lim
i→∞
1 + k1(x, z) + ... +
k
i
1
(x, z)
i!
= 1 +
i
X=∞
i=1
k
i
1
(x, z)
i!
,
where k
i
1
(x, z) is k1(x, z) to the power of i.
(f) (5 pts) Prove the Gaussian Kernel k(x, z) = exp( −kx−zk
2
σ2 ) is a valid kernel by applying Rules
1-5 in problem 1(a)-1(e).
2 SVM [25 pts]
Suppose we have the following six training examples. x1, x2, x3 are positive instances and x4, x5, x6
are negative instances. Note: we expect you to use a simple geometric argument to narrow down
the search and derive the same solution SVM optimization would result in for the following two
2
questions. You don’t need to write a program to solve this problem.
Example feature1 feature2 y
x1 1 7 1
x2 3 2 1
x3 4 10 1
x4 −1 −7 −1
x5 1 −1 −1
x6 3 −6 −1
(a) (5 pts) Suppose we are looking for a hard-SVM decision boundary wTxn + b = 0 passing
through origin (i.e., b = 0). In other words, we minimize ||w||2 subject to ynwTxn ≥ 1, n =
1, . . . , N. Identify the support vectors (data points that are actually used in the calculation
of w and margin) in this training dataset.
(b) (5 pts) Following part (a), what is w∗ ∈ R
2
in this case and what is the margin: 1
kw∗k2
?
(c) (15 pts) Suppose we now allow the offset parameter b to be non-zero. In other words, we
minimize ||w||2 subject to yn(wTxn + b) ≥ 1, n = 1, . . . , N. How would the classifier and
the actual margin change in the previous question? What are w∗
, b∗
,
1
kw∗k2
,? Compare your
solutions with problem 2(b).
3 K-means [10 pts]
In this problem, we will first work through a numerical K-means example for a one-dimensional
data and show that K-Means does not guarantee global optimal solution.
(a) (5 pts) Consider the case where K = 3 and we have 4 data points x1 = 1, x2 = 2, x3 =
5, x4 = 7. What is the optimal clustering for this data ? What is the corresponding value
of the objective PN
n=1
PK
k=1 γnkkxn − µkk
2
2
? (xn denotes the n
th data point, µk denotes the
cluster mean of the k
th cluster and γnk is a Boolean indicator variable and is set to 1 only if
n
th data point belongs to k
th cluster.)
(b) (5 pts) In K-Means, if we initialize the cluster centers as
c1 = 1, c2 = 2, c3 = 6
what is the corresponding cluster assignment and the objective PN
n=1
PK
k=1 γnkkxn − µkk
2
2
at
the first iteration? Does k-Mean algorithm improve the clusters at the second iteration?
3
4 Twitter analysis using SVMs, KMeans, GMM [35 pts]
In this project, you will be working with Twitter data. Specifically, we have supplied you with a
number of tweets that are reviews/reactions to 4 different movies1
,
e.g., “@nickjfrost just saw The Boat That Rocked/Pirate Radio and I thought it was brilliant! You
and the rest of the cast were fantastic! < 3”.
The original data can be found here if you are interested in skimming through the tweets to get a
sense of the data. We have preprocessed them for you and export them to a tweet df.txt file
Specifically, we use a bag-of-words model to convert each tweet into a feature vector. A bag-ofwords model treats a text file as a collection of words, disregarding word order. The first step in
building a bag-of-words model involves building a “dictionary”. A dictionary contains all of the
unique words in the text file. For this project, we will be including punctuations in the dictionary
too. For example, a text file containing “John likes movies. Mary likes movies2!!” will have
a dictionary {'John':0, 'Mary':1, 'likes':2, 'movies':3, 'movies2':4, '.':5, '!':6}.
Note that the (key,value) pairs are (word, index), where the index keeps track of the number
of unique words (size of the dictionary).
Given a dictionary containing d unique words, we can transform the n variable-length tweets into
n feature vectors of length d by setting the i
th element of the j
th feature vector to 1 if the i
th
dictionary word is in the j
th tweet, and 0 otherwise.
The preprocessed data contains 628 tweets on 4 different movies. Each line in the file contains
exactly one tweet, so there are 628 lines in total. If a tweet praises or recommends a movie, it is
classified as a positive review and labeled +1; otherwise it is classified as a negative review and
labeled −1. The labels for positive or negative reviews as well as the labels for which movie is this
tweet referring to are also included in the file.
You will learn to automatically classify such tweets as either positive or negative reviews. To do
this, you will employ Support Vector Machines (SVMs), a popular choice for a large number of
classification problems. You will also learn to automatically cluster such tweets in low dimensional
space with Gaussian Mixture Model (GMM) and Kmeans.
Next, please download the preprocessed version of the tweets data tweet df.txt and upload them to
your google drive. For all the coding, please refer to the following colab notebook Fall2020-CS146-
HW4.ipynb. Please save a local copy to your own drive first and only edit the TODO blocks in the
notebook.
Documentation
• Support Vector Machine: link
• F1 score: link
• PCA: link
• KMeans: link
1Please note that these data were selected at random and thus the content of these tweets do not reflect the views
of the course staff. :-)
4
• Gaussian Micture Model: link
• Adjusted Rand Index: link
4.1 Hyper-parameter Selection for a Linear-Kernel SVM [15 pts]
We will learn a classifier to separate the training data into positive and negative tweets. For the
classifier, we will use SVMs with the linear kernel. We will use the sklearn.svm.SVC class and
explicitly set only two of the initialization parameters: kernel set to ’linear’, and C. As usual, we
will use SVC.fit(X,y) to train our SVM,and SVC.predict(X) to make predictions.
SVMs have hyperparameters that must be set by the user. For linear kernel SVM, we will select
the hyper-parameter using a simple train set and a development set. In the notebook, the train,
dev, test split is already done for you. Please do NOT change that split on your own as we will
evaluate all answers under the split we provided in the Notebook. We will train several different
models with different hyper-parameters using the train set and select the hyper-parameter that
leads to the ‘best’ performance in the dev set.
(a) (10 pts) Please use F1-Score as the performance measure. Train the linear SVM with
different values of C, C = 10−3
, 10−2
, 10−1
, 1, 10, 102
, 103
. Plot the train f1 score and the dev
f1 score against different choices of C. Include this plot in your writeup, and provide a 1-2
sentence description of your observations. What is the effect of C? What is the best value of
C? What is the dev set f1-score when using the best C? Please also copy-paste your code
as part of the solution for this problem.
(b) (5 pts) Retraining the model with the best C on the train and dev set together. Report the
F1-score on the test set. Please also copy-paste your code as part of the solution for
this problem.
4.2 Low Dimensional Embedding Space Clustering [20 pts]
In this section, we will explore two unsupervised clustering algorithms: KMeans and GMM. The
preprocessed twitter data lives in high (1811) dimensional space but it is hard for us to visualize
more than three dimensional objects. Let’s project the data into a low dimensional embedding
space with a commonly used algorithm: Principal Component Analysis (PCA). We have already
provided the code to do that. Please run it and from now on work with X_embedding variable
which is 628 by 2 instead of 628 by 1811. If you are interested in the math behind PCA here are
some good reading materials: A One-Stop Shop for Principal Component Analysis, Wikipedia page
for PCA. However, for this project you are not required to know the math behind it. Intuitively,
this algorithm is extracting the most useful linear combination of the features such that it reduces
the amount of information loss occurred when projecting from high (1811) dimensions to low (2)
dimensions.
We will also evaluate the purity of the clusters returned from any unsupervised algorithms against
the ground truth labels from the original tweet_df.txt.
5
Here we will use sklearn.metrics.adjusted_rand_score. On a high level, this metric is measuring what fraction of the examples are labeled correctly but normalized so that if the cluster
assignment is near random then the adjusted rand score will be close to 0. If the assignment is
near perfect, the adjusted rand score should be close to 1.
(a) (5 pts) Visualize the low dimensional data by calling function plot_scatter. First label/color the scatter plot by positive or negative reviews. Then label/color each tweet/dot by
which movie is that review referring to. Do positive reviews and negative reviews form two
nice clusters? Do 4 different movies form nice clusters? Include both plots in your writeup.
Please also copy-paste your code as part of the solution for this problem.
(b) (7.5 pts) Use the sklearn.cluster.KMeans class on X_embedding with n_clusters set
to 4, init set to ’random’, n_init set to 1, and random_state set to 2. Visualize the low
dimensional data by labeling/coloring each tweet with Kmeans results. Calculate the adjusted
rand score
Use the sklearn.mixture.GaussianMixture class on X_embedding with n_components set
to 4, random_state set to 0 init_params set to ’random’. Visualize the low dimensional
data by labeling/coloring each tweet with GMM results. Calculate the adjusted rand score
Visually, do you think they perform poorly or great? What are the adjusted rand scores for
both Kmeans labels and GMM labels? Why might that be the case? Include both plots,
the performance results, and a 1-2 sentence description/justification of what you saw in your
writeup. Please also copy-paste your code as part of the solution for this problem.
(c) (7.5 pts) Use the sklearn.cluster.KMeans class on X_embedding with n_clusters set to
4, init set to ’random’, n_init set to 100, and random_state set to 2. Visualize the low
dimensional data by labeling/coloring each tweet with Kmeans over 100 different starting
points results. Calculate the adjusted rand score
Use the sklearn.mixture.GaussianMixture class on X_embedding with n_components set
to 4, random_state set to 0 init_params set to ’random’, and n_init set to 100. Visualize
the low dimensional data by labeling/coloring each tweet with this random initialized GMM
but with 100 different starting points results. Calculate the adjusted rand score
Do you see a huge performance gain over what we got from part (b)? Include both plots,
the performance results, and a 1-2 sentence description/justification of what you saw in your
writeup. Please also copy-paste your code as part of the solution for this problem.
6