$35
CPSC 340 Assignment 2
Instructions
Rubric: {mechanics:5}
IMPORTANT!!! Before proceeding, please carefully read the general homework instructions
at https://www.cs.ubc.ca/~fwood/CS340/homework/. The above 5 points are for following the submission
instructions. You can ignore the words “mechanics”, “reasoning”, etc.
We use blue to highlight the deliverables that you must answer/do/submit with the assignment.
1 Training and Testing
If you run python main.py -q 1, it will load the citiesSmall.pkl data set from Assignment 1. Note that
this file contains not only training data, but also test data, X_test and y_test. After training a depth-2
decision tree with the information gain splitting rule, it will evaluate the performance of the classifier on
the test data. With a depth-2 decision tree, the training and test error are fairly close, so the model hasn’t
overfit much.
1.1 Training and Testing Error Curves
Rubric: {reasoning:2}
Make a plot that contains the training error and testing error as you vary the depth from 1 through 15. How
do each of these errors change with the decision tree depth?
Answer: Overall, both the training error and the test error decrease as the depth increases. The training
error monotonically decreases whereas the test error shows decreasing and increasing behaviours as the depth
increases. The approximation error also increases as the depth increases. The training error goes to 0 as the
depth gets sufficiently large because of overfitting, and the test error converges to the approximation error
2 4 6 8 10 12 14
Depth of tree
0.00
0.05
0.10
0.15
0.20
0.25
0.30
Classification error
training error
test error
1
1.2 Validation Set
Rubric: {reasoning:3}
Suppose that we didn’t have an explicit test set available. In this case, we might instead use a validation
set. Split the training set into two equal-sized parts: use the first n/2 examples as a training set and the
second n/2 examples as a validation set (we’re assuming that the examples are already in a random order).
What depth of decision tree would we pick to minimize the validation set error? Does the answer change
if you switch the training and validation set? How could use more of our data to estimate the depth more
reliably?
Answer: We should pick depth 8 (or 9) to minimize the validation set error. The answer does not change
if we switch the training and validation set. We can make more use of our data by trying many different
combinations of such half-splittings and take the average of all optimal depths.
2 Naive Bayes
In this section we’ll implement naive Bayes, a very fast classification method that is often surprisingly
accurate for text data with simple representations like bag of words.
2.1 Naive Bayes by Hand
Consider the dataset below, which has 10 training examples and 3 features:
X =
0 0 1
0 1 1
0 1 1
1 1 0
0 1 0
0 1 1
1 0 0
1 1 0
1 0 1
1 0 0
, y =
spam
spam
spam
spam
spam
spam
not spam
not spam
not spam
not spam
.
The feature in the first column is <your name (whether the e-mail contained your name), in the second
column is “pharmaceutical” (whether the e-mail contained this word), and the third column is “PayPal”
(whether the e-mail contained this word). Suppose you believe that a naive Bayes model would be appropriate
for this dataset, and you want to classify the following test example:
xˆ =
1 1 0
.
2.1.1 Prior probabilities
Rubric: {reasoning:1} Compute the estimates of the class prior probabilities (you don’t need to show any
work):
• p(spam). Answer: 0.6
• p(not spam). Answer: 0.4
2.1.2 Conditional probabilities
Rubric: {reasoning:1}
Compute the estimates of the 6 conditional probabilities required by naive Bayes for this example (you don’t
need to show any work):
2
• p(<your name = 1 | spam). Answer: 1
6
• p(pharmaceutical = 1 | spam). Answer: 5
6
• p(PayPal = 0 | spam). Answer: 1
3
• p(<your name = 1 | not spam). Answer: 1
• p(pharmaceutical = 1 | not spam). Answer: 1
4
• p(PayPal = 0 | not spam) Answer: 3
4
.
2.1.3 Prediction
Rubric: {reasoning:1}
Under the naive Bayes model and your estimates of the above probabilities, what is the most likely label for
the test example? (Show your work.)
Answer: p(<yourname = 1, pharmaceutical = 1,PayPal = 0 | spam)p(spam)
= p(<yourname = 1 | spam)p(pharmaceutical = 1 | spam)p(PayPal = 0 | spam)p(spam)
=
1
6 ×
5
6 ×
1
3 × 0.6 = 1
36
p(<yourname = 1, pharmaceutical = 1,PayPal = 0 | not spam)p(not spam)
= p(<yourname = 1 | not spam)p(pharmaceutical = 1 | not spam)p(PayPal = 0 | not spam)p(not spam)
= 1 ×
1
4 ×
3
4 × 0.4 = 1
36
Hence, both labels are equally likely for the test example.
2.1.4 Laplace smoothing
Rubric: {reasoning:2}
One way to think of Laplace smoothing is that you’re augmenting the training set with extra counts. Consider
the estimates of the conditional probabilities in this dataset when we use Laplace smoothing (with β = 1).
Give a set of extra training examples that we could add to the original training set that would make the
basic estimates give us the estimates with Laplace smoothing (in other words give a set of extra training
examples that, if they were included in the training set and we didn’t use Laplace smoothing, would give
the same estimates of the conditional probabilities as using the original dataset with Laplace smoothing).
Present your answer in a reasonably easy-to-read format, for example the same format as the data set at
the start of this question.
Answer: We can add the following data to our dataset. It will give the same effects as Laplace smoothing
X =
1 0 1
0 1 0
1 0 1
0 1 0
, y =
spam
spam
not spam
not spam
2.2 Bag of Words
Rubric: {reasoning:3}
If you run python main.py -q 2.2, it will load the following dataset:
1. X: A binary matrix. Each row corresponds to a newsgroup post, and each column corresponds to
whether a particular word was used in the post. A value of 1 means that the word occured in the post.
2. wordlist: The set of words that correspond to each column.
3
3. y: A vector with values 0 through 3, with the value corresponding to the newsgroup that the post
came from.
4. groupnames: The names of the four newsgroups.
5. Xvalidate and yvalidate: the word lists and newsgroup labels for additional newsgroup posts.
Answer the following:
1. Which word corresponds to column 51 of X? (This is column 50 in Python.) Answer: lunar
2. Which words are present in training example 501? Answer: car, fact, gun, video
3. Which newsgroup name does training example 501 come from? Answer: talk.*
2.3 Naive Bayes Implementation
Rubric: {code:5}
If you run python main.py -q 2.3 it will load the newsgroups dataset, fit a basic naive Bayes model and
report the validation error.
The predict() function of the naive Bayes classifier is already implemented. However, in fit() the calculation of the variable p xy is incorrect (right now, it just sets all values to 1/2). Modify this function so that
p xy correctly computes the conditional probabilities of these values based on the frequencies in the data
set. Include your code and the validation error that you obtain in you pdf GradeScope submission. Also,
compare your validation error to what you obtain with scikit-learn’s implementation, BernoulliNB.
Answer: The validation error obtained with the code below is 0.188, and the validation error obtained with
scikit-learn’s implementation is 0.187
def fit ( self , X , y ) :
N , D = X . shape
# Compute the number of class labels
C = self . num_classes
# Compute the probability of each class i . e p ( y == c )
# counts = np . bincount ( y )
counts = np . zeros ( C )
for c in range (1 , C ) :
counts [ c ] = np . count_nonzero ( y == c )
counts [0] = N - np . count_nonzero ( y )
p_y = counts / N
# Compute the conditional probabilities i . e .
# p ( x (i , j ) =1 | y ( i ) == c ) as p_xy
# p ( x (i , j ) =0 | y ( i ) == c ) as p_xy
p_xy = np . zeros ([ D , C ])
for c in range ( C ) :
for d in range ( D ) :
count = 0
for n in range ( N ) :
if X [n , d ] == 1 and y [ n ] == c :
count += 1
p_xy [d , c ] = count / counts [ c ]
4
self . p_y = p_y
self . p_xy = p_xy
2.4 Runtime of Naive Bayes for Discrete Data
Rubric: {reasoning:3}
For a given training example i, the predict function in the provided code computes the quantity
p(yi
| xi) ∝ p(yi)
Y
d
j=1
p(xij | yi),
for each class yi (and where the proportionality constant is not relevant). For many problems, a lot of the
p(xij | yi) values may be very small. This can cause the above product to underflow. The standard fix for
this is to compute the logarithm of this quantity and use that log(ab) = log(a) + log(b),
log p(yi
| xi) = log p(yi) +X
d
j=1
log p(xij | yi) + (irrelevant propportionality constant).
This turns the multiplications into additions and thus typically would not underflow.
Assume you have the following setup:
• The training set has n objects each with d features.
• The test set has t objects with d features.
• Each feature can have up to c discrete values (you can assume c ≤ n).
• There are k class labels (you can assume k ≤ n)
You can implement the training phase of a naive Bayes classifier in this setup in O(nd), since you only need
to do a constant amount of work for each X(i, j) value. (You do not have to actually implement it in this
way for the previous question, but you should think about how this could be done.) What is the cost of
classifying t test examples with the model and this way of computing the predictions?
Answer: The cost of classifying t test examples with the model and this way of computing is O(tcd)
3 K-Nearest Neighbours
Rubric: {code:3, reasoning:4}
In the citiesSmall dataset, nearby points tend to receive the same class label because they are part of the
same U.S. state. For this problem, perhaps a k-nearest neighbours classifier might be a better choice than
a decision tree. The file knn.py has implemented the training function for a k-nearest neighbour classifier
(which is to just memorize the data).
Fill in the predict function in knn.py so that the model file implements the k-nearest neighbour prediction
rule. You should Euclidean distance, and may numpy’s sort and/or argsort functions useful. You can also
use utils.euclidean_dist_squared, which computes the squared Euclidean distances between all pairs of
points in two matrices.
1. Write the predict function. Include this code in your GradeScope submission.
2. Report the training and test error obtained on the citiesSmall dataset for k = 1, k = 3, and k = 10.
How do these numbers compare to what you got with the decision tree?
5
3. Include the plot generated by utils.plotClassifier on the citiesSmall dataset for k = 1, using both
your implementation of KNN and the KNeighborsClassifier from scikit-learn.
4. Why is the training error 0 for k = 1?
5. If you didn’t have an explicit test set, how would you choose k?
1. Answer: The code is given below
def predict ( self , Xtest ) :
X = self . X
y = self . y
k = self . k
distances = utils . euclidean_dist_squared (X , Xtest )
N , T = distances . shape
y_pred = np . zeros ( T )
for t in range ( T ) :
y_near = y [ distances [: , t ]. argsort () ]
y_pred [ t ] = utils . mode ( y_near [: k ])
return y_pred
2. Answer: The training and test errors for KNN and decision tree are tabulated below (in the specified
order):
k 1 3 10
Training error 0.000 0.028 0.072
Test error 0.065 0.066 0.097
k 1 3 10
Training error 0.325 0.150 0.000
Test error 0.332 0.1745 0.0835
One can see that the training error for KNN increases from 0.000 whereas the training error for decision
tree decreases to 0.000 because of overfitting.
3. Answer: The plots for our KNN and sklearn’s KNN for k = 1 are given below
160 140 120 100 80
20
30
40
50
60
class 0
class 1
6
160 140 120 100 80
20
30
40
50
60
class 0
class 1
4. Answer: The training error for k = 1 is 0 because for k = 1 the function picks the closest neighbour for
each point in our data set and assigns the label of that point to be the label of the closest neighbour.
Here, the closest neighbour is the point itself, so the label that the function assigns to every point is
its own label. This gives 100% accuracy and 0 error.
4 Random Forests
4.1 Implementation
Rubric: {code:4,reasoning:3}
The file vowels.pkl contains a supervised learning dataset where we are trying to predict which of the 11
“steady-state” English vowels that a speaker is trying to pronounce.
You are provided with a RandomStump class that differs from DecisionStumpInfoGain in that it only considers b
√
dc randomly-chosen features.1 You are also provided with a RandomTree class that is exactly the
same as DecisionTree except that it uses RandomStump instead of DecisionStump and it takes a bootstrap
sample of the data before fitting. In other words, RandomTree is the entity we discussed in class, which
makes up a random forest.
If you run python main.py -q 4 it will fit a deep DecisionTree using the information gain splitting criterion. You will notice that the model overfits badly.
1. Why doesn’t the random tree model have a training error of 0?
2. Create a class RandomForest in a file called random_forest.py that takes in hyperparameters num_trees
and max_depth and fits num_trees random trees each with maximum depth max_depth. For prediction, have all trees predict and then take the mode. Make sure to include the code you have written
in your pdf GradeScope submission.
3. Using 50 trees, and a max depth of ∞, report the training and testing error. Compare this to what we
got with a single DecisionTree and with a single RandomTree. Are the results what you expected?
Discuss.
4. Compare your implementation with scikit-learn’s RandomForestClassifier for both speed and accuracy, and briefly discuss. You can use all default hyperparameters if you wish, or you can try changing
them.
1The notation bxc means the “floor” of x, or “x rounded down”. You can compute this with np.floor(x) or math.floor(x).
7
Answer: For 50 trees, our implementation took 9.013075 second with training error 0.000 and testing error
0.178. Scikit-learn’s RandomForestClassifer took 0.068054 seconds with training error 0.000 and testing error
0.167. The same trends occur for 100 trees. Although having similar accuracies, Scikit-learn’s implementation
is a lot faster than ours.
5 Clustering
If you run python main.py -q 5, it will load a dataset with two features and a very obvious clustering
structure. It will then apply the k-means algorithm with a random initialization. The result of applying the
algorithm will thus depend on the randomization, but a typical run might look like this:
(Note that the colours are arbitrary – this is the label switching issue.) But the ‘correct’ clustering (that
was used to make the data) is this:
5.1 Selecting among k-means Initializations
Rubric: {reasoning:5}
If you run the demo several times, it will find different clusterings. To select among clusterings for a fixed
value of k, one strategy is to minimize the sum of squared distances between examples xi and their means
8
wyi
,
f(w1, w2, . . . , wk, y1, y2, . . . , yn) = Xn
i=1
kxi − wyi
k
2
2 =
Xn
i=1
X
d
j=1
(xij − wyij )
2
.
where yi
is the index of the closest mean to xi
. This is a natural criterion because the steps of k-means
alternately optimize this objective function in terms of the wc and the yi values.
1. In the kmeans.py file, add a new function called error that takes the same input as the predict
function but that returns the value of this above objective function. Make sure to include the code
you have written in your pdf GradeScope submission.
2. What trend do you observe if you print the value of this error after each iteration of the k-means
algorithm?
3. Using the code from question 5 in main.py (modify if needed), output the clustering obtained by
running k-means 50 times (with k = 4) and taking the one with the lowest error. Submit your plot.
4. Looking at the hyperparameters of scikit-learn’s KMeans, explain the first four (n clusters, init,
n init, max iter) very briefly.
5.2 Selecting k in k-means
Rubric: {reasoning:5}
We now turn to the task of choosing the number of clusters k.
1. Explain why we should not choose k by taking the value that minimizes the error function.
2. Explain why even evaluating the error function on test data still wouldn’t be a suitable approach to
choosing k.
3. Hand in a plot of the minimum error found across 50 random initializations, as a function of k, taking
k from 1 to 10.
4. The elbow method for choosing k consists of looking at the above plot and visually trying to choose
the k that makes the sharpest “elbow” (the biggest change in slope). What values of k might be
reasonable according to this method? Note: there is not a single correct answer here; it is somewhat
open to interpretation and there is a range of reasonable answers.
5.3 Density-Based Clustering
Rubric: {reasoning:2}
If you run python main.py -q 5.3, it will apply the basic density-based clustering algorithm to the dataset
from the previous part, but with some outliers added. The final output should look somewhat like this:
9
(The right plot is zoomed in to show the non-outlier part of the data.) Even though we know that each
object was generated from one of four clusters (and we have 4 outliers), the algorithm finds 6 clusters and
does not assign some of the original non-outlier objects to any cluster. However, the clusters will change if
we change the parameters of the algorithm. Find and report values for the two parameters, eps (which we
called the “radius” in class) and minPts, such that the density-based clustering method finds:
1. The 4 “true” clusters.
2. 3 clusters (merging the top two, which also seems like a reasonable interpretaition).
3. 2 clusters.
4. 1 cluster (consisting of the non-outlier points).
6 Very-Short Answer Questions
Rubric: {reasoning:13}
Write a short one or two sentence answer to each of the questions below. Make sure your answer is clear
and concise.
1. What is an advantage of using a boxplot to visualize data rather than just computing its mean and
variance?
2. What is a reason that the the data may not be IID in the email spam filtering example from lecture?
3. What is the difference between a validation set and a test set?
4. Why can’t we (typically) use the training error to select a hyper-parameter?
5. What is the effect of n on the optimization bias (assuming we use a parametric model).
6. What is an advantage and a disadvantage of using a large k value in k-fold cross-validation.
7. Why can we ignore p(xi) when we use naive Bayes?
8. For each of the three values below in a naive Bayes model, say whether it’s a parameter or a hyperparameter:
(a) Our estimate of p(yi) for some yi
.
(b) Our estimate of p(xij | yi) for some xij and yi
.
(c) The value β in Laplace smoothing.
10
9. What is the effect of k in KNN on the two parts (training error and approximation error) of the
fundamental trade-off. Hint: think about the extreme values.
10. Suppose we want to classify whether segments of raw audio represent words or not. What is an easy
way to make our classifier invariant to small translations of the raw audio?
11. Both supervised learning and clustering models take in an input xi and produce a label yi
. What is
the key difference?
12. Suppose you chose k in k-means clustering (using the squared distances to examples) from a validation
set instead of a training set. Would this work better than using the training set (which just chooses
the largest value of k)?
13. In k-means clustering the clusters are guaranteed to be convex regions. Are the areas that are given
the same label by KNN also convex?
11