$29.99
Homework 4
Place your answers to Problems 1, 2, 3(typeset) in a file called writeup.pdf. Please follow the new submission instructions. Set up a group for yourself if you haven’t already or your group composition has changed. When you submit, please submit the following THREE items as separate attachments before the due date and time:
• Your writeup pdf, named writeup.pdf. • Your jupyter notebooks, saved in HTML format. If there are multiple notebooks, submit each one separately. • The zip file containing your work (code, etc.). Be sure that the datasets you use are NOT part of the zip, as this will increase the size considerably.
1 Intuitions about support vector machines (10 points)
• (5 points) We typically frame an SVM problem as trying to maximize the margin. Explain intuitively why a bigger margin will result in a model that will generalize better, or perform better in practice. • (5 points) Will moving points which are not support vectors further away from the decision boundary effect the SVM’s hinge loss?
2 Fitting an SVM classifier by hand (15 points) Consider a dataset with two points D = {(0,−1),(√2,+1)}. We will map each of these points to three dimensions using the feature vector φ(x) = (1,√2x,x2). Recall that the maximum margin classifier has the form:
min
1 2||θ||2 such that
y(1)(θTφ(x(1)) + θ0) ≥ 1 y(2)(θTφ(x(2)) + θ0) ≥ 1 for the points (x(1),y(1)) and (x(2),y(2)) in D.
1
2• (3 points) Write down a vector that is parallel to the optimal vector θ. Hint: θ is perpendicular to the decision boundary between the two points in the three-dimensional space. • (3 points) What is the value of the margin that is achieved by this θ? Hint: the margin is twice the distance from each support vector to the decision boundary. • (3 points) Solve for the θ given that the margin is equal to 2 ||θ|| . • (3 points) Solve for the intercept θ0 using your value for θ and the inequalities above. Since both points are support vectors, the inequalities will be tight. • (3 points) Write down the equation for the decision boundary in terms of θ, θ0 and x.
3 Support vector machines for binary classification (25 points) In this exercise, you will be building support vector machines for binary classification problems. To get started, please download the code base hw4.zip from Canvas. The files relevant to this problem are shown below.
Name Edit? Read? Description binary svm.ipynb Yes Yes Python notebook that will run your functions for the two class SVM svm spam.ipynb Yes Yes Python notebook that will run your functions for spam classification using the two class SVM linear svm.py Yes Yes loss functions for the SVM linear classifier.py No Yes SVM training and prediction functions utils.py Yes Yes plot utility functions, kernels data/ex4data1.mat No No Example dataset 1 data/ex4data2.mat No No Example dataset 2 data/ex4data3.mat No No Example dataset 3 data/spamTrain.mat No No Spam training set data/spamTest.mat No No Spam test set data/vocab.txt No Yes vocabulary list hw4.pdf No Yes this document
3.1: Support vector machines
In this exercise, you will build support vector machines (SVMs) for solving binary classification problems. You will experiment with your classifier on three example 2D datasets. Experimenting with these datasets will help you gain intuition into how SVMs work and how to use a Gaussian kernel with SVMs. The provided notebook binary svm.ipynb, will help
3 you step through these parts of the exercise. In the final part of the exercise, you will be using your SVM implementation to build a spam classifier. You will complete the notebook svm spam.ipynb and use your SVM classifier to classify a given spam data set.
3.1A: The hinge loss function and gradient (5 points)
Now you will implement the hinge loss cost function and its gradient for support vector machines. Complete the binary svm loss function in linear svm.py to return the cost and gradient for the hinge loss function. Recall that the hinge loss function is
J(θ) =
1 2m
d X j=0
θj2 + C m
m X i=1
max(0,1−y(i)hθ(x(i)))
where hθ(x) = θTx with x0 = 1. C is the penalty factor which measures how much misclassifications are penalized. If y(i)hθ(x(i))) ≥ 1, then x(i) is correctly classified and the loss associated with that example is zero. If y(i)hθ(x(i))) < 1, then x(i) is not within the appropriate margin (positive or negative) and the loss associated with that example is greater than zero. The gradient of the hinge loss function is a vector of the same length as θ where the jth element, j = 0,1,...,d is defined as follows:
∂J(θ) ∂θj
= ( 1 mθj + C mPm i=1−y(i)x(i) j if y(i)hθ(x(i))) < 1 1 mθj if y(i)hθ(x(i))) ≥ 1
Reflect on why the gradient of the hinge loss function has the structure above and note that we are taking the derivative of the hinge loss with respect to the jth component of θ – a vector. Once you are done, a cell in binary svm.ipynb will call your binary svm loss function with a zero vector θ. You should see that the cost J is 1.0. The gradient of the loss function with respect to an all-zeros θ vector is also computed and should be [−0.12956186−0.00167647]T.
3.1B Example dataset 1: impact of varying C (2 points)
We will begin with a 2D example dataset which can be separated by a linear boundary (shown in Figure 1). In this dataset, the positions of the positive examples (green circles) and the negative examples (indicated with red circles) suggest a natural separation indicated by the gap. However, notice that there is an outlier positive example on the far left at about (0.1, 4.1). As part of this exercise, you will also see how this outlier affects the SVM decision boundary. In this part of the exercise, you will try using different values of the C parameter with SVMs. Informally, the C parameter is a positive value that controls the penalty for misclassified training examples. A large C parameter tells the SVM to try to classify all the examples
4
−1 0 1 2 3 4 5 x1 1.0
1.5
2.0
2.5
3.0
3.5
4.0
4.5
5.0
x2
neg pos
Figure 1: Example dataset 1
correctly. C plays a role similar to 1 λ, where λ is the regularization parameter that we were using previously for logistic regression. The SVM training function is in linear classifier.py – this is a gradient descent algorithm that uses your loss and gradient functions. The next cell in ex4.ipynb will train an SVM on the example data set 1 with C = 1. It first scales the data to have zero mean and unit variance, and adds the intercept term to the data matrix. When C = 1, you should find that the SVM puts the decision boundary in the gap between the two datasets and misclassifies the data point on the far left (Figure 2). Your task is to try different values of C on this dataset. Specifically, you should change the value of C in the script to C = 100 and run the SVM training again. When C = 100, you should find that the SVM now classifies every single example correctly, but has a decision boundary that does not appear to be a natural fit for the data (Figure 3).
SVM with Gaussian kernels
In this part of the exercise, you will be using SVMs to do non-linear classification. In particular, you will be using SVMs with Gaussian kernels on datasets that are not linearly separable.
5
−3 −2 −1 0 1 2 3 x1
−3
−2
−1
0
1
2
3
x2
neg pos
Figure 2: SVM decision boundary with C = 1 for Example dataset 1
3.1C Gaussian kernel (3 points)
To find non-linear decision boundaries with the SVM, we need to first implement a Gaussian kernel. You can think of the Gaussian kernel as a similarity function that measures the distance between a pair of examples, (x(i),x(j)). The Gaussian kernel is also parameterized by a bandwidth parameter, σ, which determines how fast the similarity metric decreases (to 0) as the examples are further apart. You should now complete the function gaussian kernel in utils.py to compute the Gaussian kernel between two examples. The Gaussian kernel function is defined as: k(x(i),x(j)) = exp −||x(i) −x(j)||2 2σ2 When you have completed the function, a cell in the notebook binary svm.ipynb will test your kernel function on two provided examples and you should expect to see a value of 0.324652.
6
−3 −2 −1 0 1 2 3 x1
−3
−2
−1
0
1
2
3
x2
neg pos
Figure 3: SVM decision boundary with C = 100 for Example dataset 1
Example dataset 2: learning non-linear boundaries
The next cell in binary svm.ipynb will load and plot dataset 2 (Figure 4). From the figure, you can observe that there is no linear decision boundary that separates the positive and negative examples for this dataset. However, by using the Gaussian kernel with the SVM, you will be able to learn a non-linear decision boundary that can perform reasonably well for the dataset. If you have correctly implemented the Gaussian kernel function, a cell in the binary svm.ipynb notebook will proceed to train the SVM with the Gaussian kernel on this dataset. Read the code in this cell carefully to see how the data set is transformed and the new parameters θ<m+1 are learned, where m is the size of the training data. Figure 5 shows the decision boundary found by the SVM with C = 1 and a Gaussian kernel with σ = 0.02. The decision boundary is able to separate most of the positive and negative examples correctly and follows the contours of the dataset well.
3.2 Example dataset 3: selecting hyper parameters for SVMs (5 points)
In this part of the exercise, you will gain more practical skills on how to use a SVM with a Gaussian kernel. The next part of binary svm.ipynb will load and display a third dataset (Figure 6). In the provided dataset, ex4data3.mat, you are given the variables X, y, Xval,
7
−0.2 0.0 0.2 0.4 0.6 0.8 1.0 1.2 0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
1.1
neg pos
Figure 4: Example dataset 2
yval. You will be using the SVM with the Gaussian kernel with this dataset. Your task is to use the validation set Xval, yval to determine the best C and σ parameter to use. You should write any additional code necessary to help you search over the parameters C and σ. For both C and σ, we suggest trying values in multiplicative steps (e.g., 0.01, 0.03, 0.1, 0.3, 1, 3, 10, 30). Note that you should try all possible pairs of values for C and σ (e.g., C = 0.3 and σ = 0.1). For example, if you try each of the 8 values listed above for C and for σ, you would end up training and evaluating (on the validation set) a total of 82 = 64 different models. When selecting the best C and σ parameter to use, you train on X,y with a given C and σ, and then evaluate the error of the model on the validation set. Recall that for classification, the error is defined as the fraction of the validation examples that were classified incorrectly. You can use the predict method of the SVM classifier to generate the predictions for the validation set. After you have determined the best C and σ parameters to use, you should replace the assignments to best C and best sigma in binary svm.ipynb with the best values you find. For our best parameters, the SVM returned a decision boundary shown in Figure 7.
8
Figure 5: SVM gaussian kernel decision boundary for Example dataset 2, C = 1 and σ = 0.02
3.3 Spam Classification with SVMs (10 points)
Many email services today provide spam filters that are able to classify emails into spam and non-spam email with high accuracy. In this part of the exercise, you will use SVMs to build your own spam filter. You will be training a classifier to classify whether a given email, x, is spam or non-spam. Throughout the rest of this exercise, you will be using the notebook svm spam.ipynb. The dataset included for this exercise is based on a subset of the SpamAssassin Public Corpus. For the purpose of this exercise, you will only be using the body of the email (excluding the email headers).
Preprocessing email
Before starting on a machine learning task, it is usually insightful to take a look at examples from the dataset. Figure 8 shows a sample email that contains a URL, an email address (at the end), numbers, and dollar amounts. While many emails would contain similar types of entities (e.g., numbers, other URLs, or other email addresses), the specific entities (e.g., the specific URL or specific dollar amount) will be different in almost every email. Therefore, one method often employed in processing emails is to normalize these values, so that all URLs are treated the same, all numbers are treated the same, etc. For example, we could replace each URL in the email with the unique string ”httpaddr” to indicate that a URL was present. This has the effect of letting the spam classifier make a classification decision based on whether any URL was present, rather than whether a specific URL was present. This
9
−0.8 −0.6 −0.4 −0.2 0.0 0.2 0.4 x1 −0.8
−0.6
−0.4
−0.2
0.0
0.2
0.4
0.6
0.8
x2
neg pos
Figure 6: Example dataset 3
typically improves the performance of a spam classifier, since spammers often randomize the URLs, and thus the odds of seeing any particular URL again in a new piece of spam is very small. Here are typical email preprocessing and normalization steps:
• Lower-casing: The entire email is converted into lower case, so that capitalization is ignored (e.g., IndIcaTE is treated the same as Indicate). • Stripping HTML: All HTML tags are removed from the emails. Many emails often come with HTML formatting; we remove all the HTML tags, so that only the content remains. • Normalizing URLs: All URLs are replaced with the text httpaddr. • Normalizing Email addresses: All email addresses are replaced with the text emailaddr. • Normalizing numbers: All numbers are replaced with the text number. • Normalizing dollars: All dollar signs ($) are replaced with the text dollar.
10
Figure 7: SVM gaussian kernel decision boundary for Example dataset 3 with the best hyper parameters.
• Word stemming: Words are reduced to their stemmed form. For example, discount, discounts, discounted and discounting are all replaced with discount. Sometimes, the stemmer actually strips off additional characters from the end, so include, includes, included, and including are all replaced with includ. • Removal of non-words: Non-words and punctuation have been removed. All white spaces (tabs, newlines, spaces) have all been trimmed to a single space character.
The result of these preprocessing steps is shown in Figure 9. While preprocessing has left word fragments and non-words, this form turns out to be much easier to work with for performing feature extraction.
Feature representation
After preprocessing the emails, we have a list of words (e.g., Figure 9) for each email. The next step is to choose which words we would like to use in our classifier and which we would want to leave out.
11 Anyone knows how much it costs to host a web portal ? Well, it depends on how many visitors youre expecting. This can be anywhere from less than 10 bucks a month to a couple of $100. You should checkout http://www.rackspace.com/ or perhaps Amazon EC2 if youre running something big.. To unsubscribe yourself from this mailing list, send an email to: groupname-unsubscribe@egroups.com
Figure 8: Sample email
anyon know how much it cost to host a web portal well it depend on how mani visitor your expect thi can be anywher from less than number buck a month to a coupl of dollarnumb you should checkout httpaddr or perhap amazon ecnumb if your run someth big to unsubscrib yourself from thi mail list send an email to emailaddr
Figure 9: Sample email - preprocessed
For this exercise, we have chosen only the most frequently occurring words as our set of words considered (the vocabulary list). Since words that occur rarely in the training set are only in a few emails, they might cause the model to overfit our training set. The complete vocabulary list is in the file vocab.txt. Our vocabulary list was selected by choosing all words which occur at least a 100 times in the spam corpus, resulting in a list of 1899 words. In practice, a vocabulary list with about 10,000 to 50,000 words is often used. Given the vocabulary list, we then map each word in the preprocessed emails (e.g., Figure 9) into a list of word indices that contains the index of the word in the vocabulary list. Figure 10 shows the mapping for the sample email. Specifically, in the sample email, the word anyone was first normalized to anyon and then mapped onto the index 86 in the vocabulary list.
86 916 794 1077 883 370 1699 790 1822 1831 883 431 1171 794 1002 1893 1364 592 1676 238 162 89 688 945 1663 1120 1062 1699 375 1162 479 1893 1510 799 1182 1237 810 1895 1440 1547 181 1699 1758 1896 688 1676 992 961 1477 71 530 1699 531
Figure 10: Word indices for Sample email. This email will be represented by a feature vector of length 1899 where the values at indices in this list are set to 1 and the rest are 0.
12Training SVMs for spam classification (10 points)
svm spam.ipynb will load a training dataset spamTrain.mat which has been pre-processed according to the scheme shown above. spamTrain.mat contains 4000 training examples of spam and non-spam email, while spamTest.mat contains 1000 test examples. That is, each original email was processed and converted into a vector x ∈<1899. Your task is to design a protocol for training an SVM classifier for this data set. Issues you need to consider are: • should you scale the data matrix and if so how, • how do you set the learning rate, • the number of iterations and • the penalty parameter C? • What kind of kernel is best for this problem, and • how do you select the corresponding kernel hyper parameters? Use the SVM with the loss function binary svm loss and the training algorithm in linear classifier.py. In writeup.pdf explain how you chose the parameters for training the SVM, providing graphs and tables to support your choices. Give us your best values for all of the chosen parameters and hyper parameters. Make sure that you do not use the test set to choose these parameters. Finally, evaluate the accuracy of your model on the test set and report it. You can use the provided vocabulary list (and the function get vocab dict in utils.py) to interpret the θ associated with the model – find the words most indicative of spam or ham using the parameters of your best model. Our best classifier gets a training accuracy of about 99.8% and a test accuracy of about 98.5%.
4 Support vector machines for multi-class classification (35 points) In this exercise, you will be building support vector machines for multi-class classification problems. To get started, please download the code base hw4.zip from Canvas. The files relevant to this problem are shown below.
Name Edit? Read? Description multiclass svm.ipynb Yes Yes Python notebook to run your functions for learning an SVM classifier for CIFAR-10 data linear svm.py Yes Yes loss functions for the SVM linear classifier.py Yes Yes SVM training and prediction functions data utils.py No Yes Functions for reading CIFAR-10 data utils.py No Yes plot utility functions, kernels hw4.pdf No Yes this document
13In this exercise you will: implement • a fully-vectorized loss function for the multi-class SVM classifier, • a fully-vectorized expression for its analytic gradient, • check your implementation using numerical gradient, • use a validation set to tune the learning rate and regularization strength, • optimize the loss function with stochastic gradient descent and • visualize the final learned weights on the CIFAR-10 dataset.
Download the data
Open up a terminal window and navigate to the datasets directory of hw4. Run the get datasets.sh script. On my Mac, I just type in ./get datasets.sh at the shell prompt. A new folder called cifar 10 batches py will be created and it will contain 60,000 labeled images for training and 10,000 labeled images for testing. If you have already downloaded this data set before, then go into data utils.py and change the path to the datasets folder there. We have provided a function to read this data in data utils.py – this function also partitions the training data into a training set of 49,000 images and a validation set of 1,000 images used for selecting hyperparameters for softmax regression. Each image is a 32×32 array of RGB triples. It is preprocessed by subtracting the mean image from all images, and flatted into a 1-dimensional array of size 3072. Then a 1 is appended to the front of that vector to handle the intercept term. So the training set is a numpy matrix of size 49000×3073, the validation set is a matrix of size 1000×3073 and the set-aside test set is of size 10000×3073. We also have a random sample of 500 images from the training data to serve as a development set or dev set to test our gradient and loss function implementations. Make sure you use the version of data utils.py included with this assignment, and not a previous one.
4A: Loss and gradient function for multi-class SVM – naive version (5 points)
Your code for this section will all be written inside linear svm.py. You will write the function svm loss naive which uses for loops to evaluate the multiclass SVM loss function. The multi-class SVM loss is set up so that the score associated with the correct class for each example is higher than the score for the other (incorrect classes) by some fixed margin ∆. The Multiclass SVM loss for the ith example is formalized as follows: L(i)((x(i),y(i)),∆) = X j6=y(i) max(0,θ(j)Tx(i) −θy(i)Tx(i) + ∆)
14where θ(j) is the jth column of the θ matrix, corresponding to the jth class. Suppose that we have three classes and let θ∗x(i) be the vector [13,−7,11]. Let ∆ = 10 and the true class y(i) = 0. The expression above sums over all incorrect classes (j 6= y(i)), so we get two terms in our loss expression for this example x(i).
L(i) = max(0,−7−13 + 10) + max(0,11−13 + 10) = 0 + 8 = 8 The first term evaluates to zero since −7 − 13 + 10 is a negative number, which is then thresholded to zero with the max(0,−) function. Intuitively, we get zero for the first term because the score associated with the correct class (13) is greater than the score for class 1 (−7) by at least the margin 10. In fact, the difference is 20, which is much greater than 10, but the SVM only cares that the difference is at least 10; any additional difference above the margin is clamped at zero with the max operation. The second term evaluates to 11−13+10 = 8. That is, even though the correct class had a higher score than the incorrect class (13 11), it was not greater by the desired margin of 10. The difference was only 2, which is why the loss comes out to 8 (i.e. how much higher the difference would have to be to meet the margin). In summary, the SVM loss function wants the score of the correct class y(i) to be larger than the incorrect class scores by at least by ∆. If this is not the case, it accumulates loss. svm naive loss sums up loss over all the examples i = 1,...,m. The gradient returned from the function is right now all zero. Derive and implement the gradient for the multi-class SVM cost function and implement it in the function svm loss naive in linear svm.py. To check that you have correctly implemented the gradient, we will numerically estimate the gradient of the loss function and compare the numeric estimate to the gradient that you computed. We have provided code that does this for you in multiclass svm.ipynb. It is possible that once in a while a dimension in the gradient check will not match exactly. What could such a discrepancy be caused by? Is it a reason for concern? Hint: the SVM loss function is not, strictly speaking, differentiable.
4B: Loss and gradient function for multi-class SVM – vectorized version (10 points)
Now complete the function svm loss vectorized in linear svm.py to implement the loss function J(θ) and the gradient of the loss function without using any for loops. Re-express the computations in terms of matrix operations on X, y and θ.
Implementing mini-batch gradient descent
In large-scale applications, the training data can have millions of examples. Hence, it seems wasteful to compute the loss function over the entire training set in order to perform only a single parameter update. A very common approach to addressing this challenge is to compute the gradient over batches of the training data. For example, a typical batch contains 256
15 examples from a training set of over 1.2 million. This batch is then used to perform a parameter update: θ(k) → θ(k) −α∇θ(k)J(θ) where α is the step size or learning rate for gradient descent. We have implemented mini-batch gradient descent in the method train in linear classifier.py. You can set the verbose argument of train to be True and observe how the loss function varies with iteration number. A cell in the multiclass svm.ipynb notebook sets up a multiclass svm instance and trains it with a specific learning rate and regularization strength for 1500 iterations. This will give you an idea of how long it takes to train an SVM on this data set.
4C: Prediction function for multi-class SVM (5 points)
Write the predict function in linear classifier.py for the multiclass SVM. Then, a cell in the multiclass svm.ipynb notebook will evaluate the performance of your SVM learned in the previous step on both the training and validation set. You should expect to see about 37-39% accuracy here.
4D: Tuning hyper parameters for training a multi-class SVM (10 points)
Use the validation set to tune hyperparameters (regularization strength and learning rate). You should experiment with 4 learning rates and 4 regularization strengths; if you are careful you should be able to get a classification accuracy of about 0.38 (or just under 0.4) on the validation set. Suggested parameter values are provided in the Python notebook – feel free to experiment outside that range. Write code that chooses the best hyperparameters by tuning on the validation set. For each combination of hyperparameters, train a linear SVM on the training set, compute its accuracy on the training and validation sets, and store these numbers in the results dictionary. In addition, store the best validation accuracy in best val and the LinearSVM object that achieves this accuracy in best svm. Hint: You should use a small value for num iters as you develop your validation code so that the SVMs don’t take much time to train; once you are confident that your validation code works, you should rerun the validation code with a larger value for num iters. We have provided visualization code in the succeeding cells for you to see the variation in training and validation set accuracy with changes in regularization strength and learning rate. Then, evaluate the test set accuracy on the best svm learned. The final cell visualizes the θ associated with each of the ten class. You can save the figure in a PDF file and attach it to your writeup.pdf, or leave the best result in your notebook when you upload it on Canvas.
16 4E: Comparing the performance of multi-class SVM and softmax regression (5 points)
Compare the performance of the multi-class SVM you have built in this assignment with the softmax regression that you built for the previous one. Which approach takes longer to train, which approach achieves higher performance? Compare the visualizations of the θ parameters learned by both methods – do you see any differences? Comment on hyper parameter selection for both methods. Place your discussion with supporting graphs and plots in writeup.pdf.