$30
Machine Learning
Homework 5
Version 1
This homework contains 3 questions. The last two questions require programming. The maximum
number of points is 100 plus 10 bonus points.
1 Question 1 – Boosting (30 points)
We learned about boosting in lecture and the topic is covered in Murphy 16.4. On page 555 Murphy claims
that “it was proved that one could boost the performance (on the training set) of any weak learner arbitrarily
high, provided the weak learned could always perform slightly better than chance.” We will now verify this
in the AdaBoost framework.
1. (7 points) Given a set of N observations (x
j
, yj
) where y
j
is the label y
j ∈ {−1, 1}, let ht(x) be the
weak classifier at step t and let αt be its weight. First we note that the final classifier after T steps is
defined as:
H(x) = sgn (X
T
t=1
αtht(x)
)
= sgn{f(x)}
Where
f(x) = X
T
t=1
αtht(x)
Show that:
?Training =
1
N
X
N
j=1
δ(H(x
j
) 6= y
j
) ≤
1
N
X
N
j=1
exp(−f(x
j
)y
j
)
Where δ(H(x
j
) 6= y
j
) is 1 if H(x
j
) 6= y
j
and 0 otherwise.
2. (8 points) The weight for each data point j at step t + 1 can be defined recursively by:
w
(t+1)
j =
w
(t)
i
exp(−αty
jht(x
j
))
Zt
Where Zt
is a normalizing constant ensuring the weights sum to 1:
Zt =
X
N
j=1
w
t
j
exp(−αty
jht(x
j
))
Show that:
1
N
X
N
j=1
exp(−f(x
j
)y
j
) = Y
T
t=1
Zt
1
3. (15 points) We showed above that training error is bounded above by QT
t=1 Zt
. At step t the values
Z1, Z2, . . . , Zt−1 are already fixed therefore at step t we can choose αt
to minimize Zt
. Let
?t =
X
N
j=1
w
t
j
δ(ht(x
j
) 6= y
j
)
be the weighted training error for weak classifier ht(x) then we can re-write the formula for Zt as:
Zt = (1 − ?t) exp(−αt) + ?t exp(αt)
(a) First find the value of αt
that minimizes Zt
then show that
Z
opt
t = 2p
?t(1 − ?t)
(b) Assume we choose Zt
this way. Then re-write ?t =
1
2 − γt where γt 0 implies better than
random and γt < 0 implies worse than random. Then show that:
Zt ≤ exp(−2γ
2
t
)
You may want to use the fact that log(1 − x) ≤ −x for 0 ≤ x < 1
Thus we have:
?training ≤
Y
T
t=1
Zt ≤ exp(−2
X
T
t=1
γ
2
t
)
(c) Finally, show that if each classifier is better than random (e.g. γt ≥ γ for all t and γ 0) that:
?training ≤ exp(−2T γ2
)
Which shows that the training error can be made arbitrarily small with enough steps.
2 Programming Question (clustering with K-means) [30 points]
In class we discussed the K-means clustering algorithm. Your programming assignment this week is to
implement the K-means algorithm on digit data.
2.1 The Data
There are two files with the data. The first digit.txt contains the 1000 observations of 157 pixels
(a subset of the original 785) from images containing handwritten digits. The second file labels.txt
contains the true digit label (either 1, 3, 5, or 7). You can read both data files in with
X = load(‘../hw5data/digit/digit.txt’);
Y = load(‘../hw5data/digit/labels.txt’);
Please note that there aren’t IDs for the digits. Please assume the first line is ID 1, the second line is ID
2, and so on. The labels correspond to the digit file, so the first line of labels.txt is the label for the digit in
the first line of digit.txt.
2
2.2 The algorithm
Your algorithm should be implemented as follows:
1. Select k starting centers that are points from your data set. You should be able to select these centers
randomly or have them given as a parameter.
2. Assign each data point to the cluster associated with the nearest of the k center points.
3. Re-calculate the centers as the mean vector of each cluster from (2).
4. Repeat steps (2) and (3) until convergence or iteration limit.
Define convergence as no change in label assignment from one step to another or you have iterated 20
times (whichever comes first). Please count your iterations as follows: after 20 iterations, you should have
assigned the points 20 times.
2.3 Within group sum of squares
The goal of clustering can be thought of as minimizing the variation within groups and consequently maximizing the variation between groups. A good model has low sum of squares within each group.
We define sum of squares in the traditional way. Let Ck be the k
th cluster and let µk be the mean of the
observations in cluster Ck. Then the within group sum of squares for cluster Ck is defined as:
SS(k) = X
i∈Ck
||xi − µk
||2
Please note that the term ||xi − µk
||2
is the euclidean distance between xi and µk
.
If there are K clusters total then the “total within group sum of squares” is just the sum of all K of these
individual SS(k) terms.
2.4 Pair-counting measures
Given that we know the actual assignment labels for each data point we can attempt to analyze how well the
clustering recovered this. WE will performance criteria which are based on two principles: i) two data points
that belong to the same class should be assigned to the same cluster; and ii) two data points that belong to
different classes should be assigned to different clusters. Formally speaking, consider all pairs of same-class
data points, let p1 be the percentage of pairs of which both data points were assigned to the same cluster.
Consider all pairs of different-class data points, let p2 be the percentage of pairs of which two data points are
assigned to different clusters. Let p3 be the average of these two values p3 = (p1+p2)/2, which summarizes
the clustering performance.
2.5 Questions
When you have implemented the algorithm please report the following:
1. [8pts] The values of the total within group sum of squares and pair-counting measures (p1, p2, p3) for
k = 2, k = 4 and k = 6. Start your centers with the first k points in the dataset. So, if k = 2, your
initial centroids will be ID 1 and ID 2, which correspond to the first two lines in the file.
2. [7pts] The number of iterations that k-means ran for k = 6, starting the centers as in the previous item.
Make sure you count the iterations correctly. If you start with iteration i = 1 and at i = 4 the cluster
assignments don’t change, the number of iterations was 4, as you had to do step 2 four times to figure
this out.
3
3. [7pts] A plot of the total within group sum of squares versus k for k = 1, 2, 3, 4, 5, 6, 7, 8, 9, 10. Start
your centers randomly (choose k points from the dataset at random).
4. [8pts] A plot of p1, p2, p3 versus k for k = 1, 2, 3, 4, 5, 6, 7, 8, 9, 10. Start your centers randomly
(choose k points from the dataset at random).
For the last two items, you should run k-means algorithm several times (e.g., 10 times) and average the
results. For each question, submit a single plot, which is the average of the runs.
3 Programming Question (scene classification) [40 points + 10 bonus]
This question gives you the opportunities to learn LibSVM, which is one of the best software tool for classification problem. You will train SVMs with different kernels and use them to classify scenes from The
Big Bang Theory, your favorite TV series. To classify an image, you will use Bag-of-Word representation,
which is one of the most popular image representation. This representation views an images as the histogram of image features, or visual words. You MUST use LibSVM and your K-means implementation
from Question 3.
3.1 LibSVM installation
First download LibSVM https://www.csie.ntu.edu.tw/˜cjlin/libsvm/. Follow the instruction in README to install LibSVM for Matlab or Python. Two main functions of LibSVM that you should
pay attention to are: svmtrain and svmpredict. Note that Matlab also has a machine learning toolbox
that comes with these two functions with exactly the same names. However, Matlab’s SVM implementation is not as good as LibSVM, so you need to make sure that you are using svmtrain and svmpredict from
LibSVM. To check if you have installed the program correctly, in Matlab do:
which svmtrain
which svmpredict
Matlab should return the paths to the svmtrain and svmpredict of LibSVM. To learn how to use these
functions, type the names of the function in Matlab:
svmtrain
svmpredict
3.2 Data
Training and test images are provided in the subdirectory bigbangtheory. The training image ids and
labels are given in train.mat. This file contains two variables: imgIds and lbs. imgIds is a column
vector and each row has a name of image in the training set. lbs is a matrix denoting the label for the
image with the corresponding index. There are total 8 classes for the dataset: living room (1), kitchen (2),
hallway (3), Penny’s living room (4), cafeteria (5), Cheesecake factory (6), laundry room (7), and comic
bookstore (8).
Validation set is not provided for this question. You have to do cross validation to find the parameter for
the best performance. You can implement cross validation by yourself, or you can use LibSVM functionality.
Image ids for test set are given in test.mat.
3.3 Helper functions
A number of Matlab helper functions are provided. Also, some functions are left unfinished and you have to
complete them.
1. Use HW5 BoW.learnDictionary() to learn visual vocabulary for scene representation. You
have to fill your K-means implementation from Question 3 in this function.
4
2. Use HW5 BoW.cmpFeatVecs() to compute feature vectors. This function will return the histogram
of visual words for each image, which is our feature vector.
3.4 What to implement?
1. Complete the function HW5 BoW.learnDictionary(). This function learns a visual vocabulary
by running k-means on random image patches. Learn a visual dictionary with K = 1000 clusters.
2. (10 points) Using SVM with RBF kernel, report the 5-fold cross-validation accuracy for the default
kernel parameters for C and γ.
3. (10 points) Tune C and γ for SVM with RBF kernel using 5-fold cross validation. Report the values
of C, γ, and the 5-fold cross-validation accuracy.
4. Unfortunately, LibSVM does not support exponential X
2 kernel, so you will need to implement it.
Implement a function:
[trainK, testK] = cmpExpX2Kernel(trainD, testD, gamma)
that takes train and test data and the kernel parameter gamma and return the train and test kernels.
Recall that the exponential X
2 kernel value for two d-dimensional feature vector x and y is:
k(x, y) = exp
−
1
γ
X
d
i=1
(xi − yi)
2
xi + yi + ?
!
(1)
The ? term is added to avoid division by 0.
5. (10 points) LibSVM allows using pre-computed kernels. Train an SVM with exponential X
2 kernel.
Report the 5-fold cross validation accuracy with the best tuned values of C and γ. Note that a good
default value of γ is the average of X
2 distance between training data points, as discussed in the
lecture.
6. (10 points) Use your model to predict on the test set. Save the predicted labels on a CSV file
predTestLabels.csv (see Section 4.3 for the format). The order of predicted labels should
correspond to the order of the reviews in test.mat. Submit this predTestLabels.csv file on
Kaggle and report classification accuracy in your answers.pdf file. You must use Stony Brook NetID
email to register for Kaggle.
Note: Marks for this question will be scaled according to the ranking on the Private Leaderboard.
7. (10 bonus points) We will maintain a leader board on Kaggle, and the top three entries at the end of the
competition (due date) will receive 10 bonus points. You must use your real names and Stony Brook
Net ID on Kaggle to compete for the bonus points.
To prevent exploiting test data, you are allowed to make a maximum of 3 submissions per 24 hours.
Your submission will be evaluated immediately and the leader board will be updated.
4 What to submit?
4.1 Blackboard submission
4.2 Blackboard submission
You will need to submit both your code and your answers to questions on Blackboard. Put the answer file and
your code in a folder named: SUBID FirstName LastName (e.g., 10947XXXX lionel messi). Zip this folder
and submit the zip file on Blackboard. Your submission must be a zip file, i.e, SUBID FirstName LastName.zip.
The answer file should be named: answers.pdf. The first page of the answers.pdf should be the filled cover
page at the end of this homework. The remaining of the answer file should contain: answers, derivations,
and plots for Questions 1, 2 and 3. You can use Latex if you wish, but it is not compulsory.
5
4.3 Kaggle submission
For Questions 3.4.6 and 3.4.7, you must submit a CSV file to get the classification accuracy from the competition site https://www.kaggle.com/c/cse512-hw5/overview. A submission file should contain two columns: ImgId and Prediction. The file should contain a header and have the following format.
ImgId, P rediction
1778, 2
1779, 5
... ...
A sample submission file is available from the competition site and our handout.