$30
1 Programming (50 points)
1.1 Pegasos (35 points)
You will implement a version of SVM called Pegasos. We’ll use the same coding framework that we used in homework 1. Pegasos is a simple but effective stochastic gradient descent algorithm to solve a Support Vector Machine for binary classification problems. The approach is called Primal Estimated sub-GrAdient SOlver for SVM (Pegasos).1 The number of iterations required by Pegasos to obtain a solution of accuracy ? is O(1/?), while previous stochastic gradient descent methods require O(1/?2).
1.1.1 SVM
A Support Vector Machine constructs a hyperplane in high dimensional space, which separates training points of different classes while keeping a large margin with regards to the training points closest to the hyperplane. SVMs are typically formulated as a constrained quadratic programming problem. We can also reformulate it as an equivalent unconstrained problem of empirical loss plus a regularization term. Formally, given a training set{(xi,yi)}N i=1, where xi ∈RM and yi ∈{+1,−1}, we want to find the minimizerof the problem:
min w
λ
1 2||w||2 +
1 N
N X i=1
l(w;(xi,yi)) (1)
where λ is the regularization parameter, l(w;(x,y)) = max{0,1−yhw,xi} (2) andh·,·iis the inner product operator. As you might have noticed, we omit the parameter for the bias term throughout this homework (i.e., the hyperplane is always across the origin).
1Shalev-Shwartz, S., Singer, Y., Srebro, N., Cotter, A. (2011). Pegasos: Primal estimated sub-gradient solver for svm. Mathematical programming, 127(1), 3-30.
Fall 2018 CS 475 Machine Learning: Homework 2 2
1.1.2 Pegasos
Pegasos performs stochastic gradient descent on the primal objective Eq. 1. The learning rate decreases with each iteration to guarantee convergence. We adopt a simple strategy to decreasing the learning rate: we make the learning rate a function of the time steps. On each time step Pegasos operates as follows. Initially, we set w0 = 0. On time step t of the algorithm (starting from t = 1) we first choose a random training example (xit,yit) by picking an index it ∈ {1,...,m} uniformly at random (although for your implementation, you should not select the example randomly, but rather iterate through the examples in order. For further details, see Section 1.1.6). We then replace the objective in Eq. 1 with an approximation based on the training example (xit,yit), yielding:
f(w;it) = λ
1 2||w||2 + l(w;(xit,yit)) (3) We consider the sub-gradient of the above approximate objective, given by:
∂f(w;it) ∂wt
= λwt −1[yithwt,xiti < 1]yitxit (4) where 1[·] is the indicator function which takes a value of 1 if its argument is true, and 0 otherwise. We then update wt+1 ← wt −ηt ∂f(w;it) ∂wt using a step size of ηt = 1/(λt). Note that this update can be written as:
wt+1 ← (1−
1 t
)wt +
1 λt
1[yithwt,xiti < 1]yitxit (5) After a predetermined number of iterations T, we output the last iterate wT. To be clear, a single iteration t should be a single example. So, you should be updating the weights after seeing a single example, not in batch mode. You will note that the above update rule changes w even for cases where the value in xi is 0. This is a non-sparse update: every feature is updated every time. While there are sparse versions of Pegasos, for this homework you should implement the non-sparse solution. This means that every time you update, every parameter must be updated. This will make training slower on larger feature sets (such as NLP) but it should still be reasonable. We suggest you focus testing on the smaller and thus faster datasets.
1.1.3 Offset Feature
None of the math above mentions an offset feature (bias feature) w0, that corresponds to a xi,0 that is always 1. It turns out that we don’t need this if our data is centered. By centered we mean that E[y] = 0. For simplicity, assume that the data used in this assignment is centered (even though this may not be true). Do not include another feature that is always 1 (x0) or weight (w0) for it.
1.1.4 Convergence
In practice, SGD optimization requires the program to determine when it has converged. Ideally, a maximized function has a gradient value of 0, but due to issues related to your step size, random noise, and machine precision, your gradient will likely never be
Fall 2018 CS 475 Machine Learning: Homework 2 3
exactly zero. Common practice is to check that the Lp norm of the gradient is less than some δ, for some p. For the sake of simplicity and consistent results, we will not do this in this assignment. Instead, your program should take a parameter –online-trainingiterations which is exactly how many iterations you should run (not an upper bound). An iteration is a single pass over every training example. Note that the “t” in Section 1.1 indexes a time step, which is different from the “iterations” defined here. The default of –online-training-iterations should be 5. Since you added this argument in the previous assignment, no further changes should be needed.
1.1.5 Regularization Parameter
Pegasos uses a regularization parameter λ to adjust the relative strength of regularization. Your default value for λ should be 10−4. You must add a command line argument to allow this value to be adjusted via the command line. Add this command line option by adding the following code to the get args function in classify.py.
parser.add_argument("--pegasos-lambda", type=float, help="The regularization parameter for Pegasos.", default=1e-4)
Be sure to add the option name exactly as it appears above. You can then use the value read from the command line in your main method by referencing it as args.pegasos lambda. Note that the dashes have been replaced by underscores.
1.1.6 Sampling Instances
Pegasos relies on sampling examples for each time step. For simplicity, we will NOT randomly sample instances. Instead, on round t you should update using the tth example. An iteration involves a single pass in order through all the provided instances. You will make several passes through the data based on --online-training-iterations.
1.2 Implementation Notes
1. Many descriptions of SGD call for shuffling your data before learning. This is a good idea to break any dependence in the order of your data. In order to achieve consistent results for everyone, we are requiring that you do not shuffle your data. When you are training, you will go through your data in the order it appeared in the data file. 2. In SVM, y ∈{+1,−1}, but the class labels in our data files are 0/1 valued. To resolve this inconsistency, convert class label 0 to −1 before training. 3. During prediction, a new instance x is predicted by the following rule: ˆ
ynew = 1 if hw,xi≥ 0 ˆ ynew = 0 otherwise
4. Initialize the parameters w to 0.
1.2.1 Deliverables
You need to implement the Pegasos algorithm. Your Pegasos predictor will be selected by passing the string pegasos as the argument for the algorithm parameter.
Fall 2018 CS 475 Machine Learning: Homework 2 4
1.2.2 How Your Code Will Be Called
To train a model we will call:
python classify.py --mode train --algorithm pegasos \ --model-file speech.pegasos.model \ --data speech.train
There are some additional parameters which your program must support during training:
--pegasos-lambda lambda // sets the the constant scalar, default = 1e-4 --online-training-iterations t // sets the number of SGD iterations, default = 5
All of these parameters are optional. If they are not present, they should be set to their default values.
To make predictions using a model we will call:
python classify.py --mode test --algorithm pegasos \ --model-file speech.pegasos.model \ --data speech.test \ --predictions-file speech.test.predictions
Remember that your output should be 0/1 valued, not real valued.
1.3 Jupyter Notebook - SVMs (15 points)
Complete the provided python Jupyter notebook template. In the notebook, you’ll explore different settings for SVMs, including the effect of the C (slack) penalty terms, and different types of kernels. You’ll also see how the decision boundary is different from logistic regression.
1.3.1 Completing the Notebook
The notebook contains detailed directions for how to complete it. We’ll use the sklearn implementations of Logistic Regression and SVM, so you don’t need to reference your implementation. See http://scikit-learn.org/stable/ for more details. We have noted in the Notebook which cells need to be changed to receive credit with TODO comments. We have also added Markdown cells with the title “Notebook Question N: ...”. For each, please answer the question in the L ATEX template we provide for the Analytical questions - there will be boxes provided at the end with the title “Notebook Question N”., where N is a specific question number. You will use the graphs generated by the notebook to answer these questions, but you don’t have to turn in the actual graphs. Some of the plots are a bit slow to generate, as they are training SVM models. However, they should appear after waiting a bit.
2 Analytical (50 points)
1) Decision Trees (9 points) Consider the classification task where you want to predict y given x1,x2,x3,x4.
Fall 2018 CS 475 Machine Learning: Homework 2 5
x1 x2 x3 x4 y 1 0 0 0 1 1 0 1 0 1 0 1 0 1 1 0 0 0 1 0 1 1 1 0 0 1 0 1 1 0 1 0 0 1 0 0 1 0 0 0
(1) Construct a decision tree based on the above training examples following the algorithm we specified in class (using information gain). Set the maximum depth of the tree to 2, and use the base case of all training examples at a node having the same label. You may use each feature a maximum of one time within each branch of the tree. In the decision tree plot below, specify the correct feature for nodes A, B and C. For D, E, F, and G, specify the correct label. Place the correct feature or label in the corresponding box in the answer template.
Figure 1: Decision tree
(2) Apply the decision tree you have learned to classify the following new data points. List your classification for each of the points.
Example x1 x2 x3 x4 y 1 0 1 1 1 ? 2 1 1 0 1 ? 3 1 1 0 0 ?
(3) For the given dataset, can a logistic regression model obtain a training accuracy of 100%? If so, please write out the parameters of the logistic model. If not, please explain why not.
2) Loss function (9 points) We have discussed several loss functions for classification problems: logistic loss, 0-1 loss and hinge regression. Consider if each of the following
Fall 2018 CS 475 Machine Learning: Homework 2 6
statements is true for one or more of the three loss functions listed (consider no regularization added to the loss function). If true, check the box corresponding to the algorithm for which the statement is true. Alternatively, check the box corresponding to ”None” if none of the three functions applies. Please answer each of the following statements for each of the three loss functions.
(1) We can use gradient descent methods to minimize the loss.
(2) This loss is sensitive to outliers, meaning, it will give a larger penalty for more incorrect predictions.
(3) The loss function only produces a non-zero penalty if the prediction is incorrect.
3) Classification algorithms (10 points) Compare the following four algorithms for classification problems: logistic regression, perceptron, (linear and kernel) SVM, decision tree. Consider if each of the following statements is true or false. If true, check the box corresponding to ”True”. Otherwise, check the box corresponding to ’False’, and briefly explain why.
(1) The Perceptron convergence theorem shows that a Perceptron trained on any stream of data will make a finite number of errors.
(2) All four algorithms can accept as input x values with continuous features. (3) Consider the Gaussian kernel: K(x,x0) = exp(−||x−x0||2/(2σ2)), where σ is a positive scalar. Increasing σ would make over-fitting more likely.
(4) There is no advantage when selecting between the primal and dual formulation of an SVM if you are using a linear kernel.
(5) Training AdaBoost past the point at which it obtains perfect classification on the training data will most likely lead to overfitting.
4) Adaboost (10 points) Assume a one-dimensional dataset with positive example x = 0 and two negative examples x = ±1. Consider three weak classifiers: h1(x) = 1·1[x 1/2]−1·1[x ≤ 1/2], h2(x) = 1·1[x −1/2]−1·1[x ≤−1/2] h3(x) = 1. where 1[·] is the indicator function which takes a value of 1 if its argument is true, and 0 otherwise.
(1) Run boosting for up to 10 iterations. What is the accuracy of the final classifier on this training dataset?
(2) Write out the selected hypothesis and value of α for the first two iterations. Use a log function with base e (i.e. “ln” in your calculations).
(3) Suppose we wanted to learn an SVM on this dataset. What Kernel function should we use in order to obtain perfect training accuracy?
Fall 2018 CS 475 Machine Learning: Homework 2 7
5) Hinge Loss (12 points) Linear SVMs using a squared hinge loss that can be formulated as an unconstrained optimization problem: ˆ
w = argminw
n X i=1
H(yi(wTxi)) + λkwk2 2, (6) where λ is the regularization parameter and H(a) = max(1−a,0)2 is the squared hinge loss function. The hinge loss function can be viewed as a convex surrogate of the 0/1 loss function I(a ≤ 0). (1) Compared with the standard hinge loss function, what do you think are the advantages and disadvantages of the square hinge loss function? (2) The function L(a) = max(−a,0)2 can also approximate the 0/1 loss function. What is the disadvantage of using this function instead? (3) We can choose a different loss function H0(a) = max(0.5 − a,0)2. Specifically, the new objective becomes: ˆ
w0 = argmin w
n X i=1
H0(yi(wTxi)) + λ0kwk2 2. (7)
Consider how switching to H0 from H will effect the solution of the objective function. Write the relationship between λ and λ0 in the answer box.
3 What to Submit
In each assignment you will submit three things.
1. Submit your code (.py files) to cs475.org as a zip file. Your code must be uploaded as code.zip with your code in the root directory. By ‘in the root directory,’ we mean that the zip should contain *.py at the root (./*.py) and not in any sort of substructure (for example hw1/*.py). One simple way to achieve this is to zip using the command line, where you include files directly (e.g., *.py) rather than specifying a folder (e.g., hw1):
zip code.zip *.py
A common mistake is to use a program that automatically places your code in a subfolder. It is your job to make sure you have zipped your code correctly. We will run your code using the exact command lines described earlier, so make sure it works ahead of time, and make sure that it doesn’t crash when you run it on the test data. A common mistake is to change the command line flags. If you do this, your code will not run. Remember to submit all of the source code, including what we have provided to you. We will include requirements.txt but nothing else.
2. Submit your writeup to gradescope.com. Your writeup must be compiled from latex and uploaded as a PDF. The writeup should contain all of the answers to the analytical questions asked in the assignment. Make sure to include
Fall 2018 CS 475 Machine Learning: Homework 2 8
your name in the writeup PDF and to use the provided latex template for your answers following the distributed template. You will submit this to the assignment called “Homework 1: Supervised Classifiers 1: Written”.
3. Submit your Jupyter notebook to gradescope.com. Create a PDF version of your notebook (File → Download As → PDF via LaTeX (.pdf)). Be sure you have run the entire notebook so we can see all of your output. You will submit this to the assignment called “Homework 1: Supervised Classifiers 1: Notebook”. When you submit on gradescope, you will indicate which output box corresponds to which questions.
.