Starting from:

$29

Assignment 2 Perceptron Exercise

Assignment 2 
1 PerceptronExercise[20points] Recall that a perceptron learns a linear classifier with weight vector w. It predicts ˆ y = sign(w·x) (We assume here that ˆ y ∈{+1,−1}. Also, note that we are not using a bias weight b.) When theperceptron makes a mistake on the nth training example, it updates the weights using the formula w ← w + ynxn Imagine that we have each xn ∈R2, and we encounter the following data points x[1] x[2] y 1 1 1 2 -1 -1 -3 -1 -1 -3 1 1
1. Starting with w = [0 0], use the perceptron algorithm to learn on the data points in the order from top to bottom. Show the perceptron’s linear decision boundary after observing each data point in the graphs below. Be sure to show which side is classified as positive. [10 points]
2 of 10
3 of 10
2. Does our learned perceptron maximize the margin between the training data and the decision boundary? If not, draw the maximum-margin decision boundary on the graph below. [5 points]
3. Assume that we continue to collect data and train the perceptron. If all data we see (including the points we just trained on) are linearly separable separable with margin γ = 0.5 and have maximum norm kxk2 ≤ 5, what is the maximum number of mistakes we can make on future data? [5 points]
2 Implementation: Perceptron[70points] Implement the perceptron learning algorithm for binary classification, as described in class, from scratch,inPython. Allofthefilesmentionedhereareavailableinthetarball A2.tgz availableon Canvas. Implementation details:
4 of 10
• First, read in training data from a single file (we suggest that your program take the filename as a command-line argument) and store it in memory. You can assume for this assignment that you’ll never have more than 1000 training examples. • At the end of each epoch, report the number of mistakes made on that epoch to standard out. • Test data will always be in a separate file, optionally provided on the command-line. At the end of training, report the number of mistakes on the test data (if there is a test dataset). • The format for training and testing data will always be one-instance-per-line, formatted as: y tab x[1] tab x[2] tab ... tab x[d] newline where tab is a single tab character and newline is a single newline character. • Ifthere’sotherdiagnosticoutputyou’dliketogenerate,sendittostandarderror. Thiswillmake it easier in case we need to run your code. Asasanitycheck,traintheperceptrononthedatasetin A2.1.tsv,whichhas50positiveand 50 negative examples and is separable by the hyperplane corresponding to weights: w∗ = [−1.2 2.7 −7.6 −2.8 −2.8] You should be able to converge to a weight vector that achieves zero training-set error in about 20 epochs; your weight vector might not be identical to w∗, of course. Thetarballforthisassignmentcontainseightadditionaltrainingdatasets(A2.2.train.tsv, A2.3.train.tsv, ..., A2.9.train.tsv). Each one has a corresponding test dataset (files A2.*.test.tsv). For each dataset [8 points per dataset]: 1. Generate a plot in which the x-axis is the epoch, and the y-axis shows the training error rate in blue and the test error rate in red, at the end of that epoch, where you train on all of the training data. We suggest that you train for 1000 or more epochs. You might want to also include a plot that doesn’t show all epochs (only earlier ones) so that the “interesting” trends are visible. You mightalsowanttogenerateamorefine-grainedplotthatreportstrainingandtesterrorratesafter every iteration of the inner loop (i.e., after each prediction the perceptron makes on a training instance); don’t include such a plot in your writeup unless it shows something interesting we couldn’t see in the first plots. 2. Do you believe that the training dataset is linearly separable? If so, give a lower bound on the margin. 3. Do you believe that your learning algorithm overfit the training data? If so, how do you know? 4. Do you suspect that any features are noise? Which ones, and why? (Please index features from 1 to 20 in the order they appear in the data files.) 5. Carveoutadevelopmentdatasetfromthetrainingdataanduseittotunetheperceptron’shyperparameter(s). Report training, development, and test data accuracy, as well as the hyperparameter(s) you tuned and the value you selected for the hyperparameter(s). Present your answers to the non-plotting questions in a table formatted like Table 1.
Surprise! [6points] Youmaynothavenoticedthattestsets6–8areidentical(buttheyare). The three corresponding training datasets increase in size from 6 to 7 and 7 to 8, but: (i) they don’t
5 of 10
question number dataset 2 3 4 5 2 3 4 5 6 7 8 9
Table 1: Write your answers on part 2 into a table like this one.
overlap and (ii) they do come from the same source. What do you notice when you compare your findings for these three datasets? Can you use this information to get better performance on test set 6 (equivalently, test set 7, equivalently test set 8), still using your perceptron implementation? If so, go for it and give answers to the same questions as above (1–5).
3 BeatthePerceptron[10points] Choose one of the datasets (2–9) where the perceptron’s test-set performance was not strong, and try to improve using any of the following: • Decision stump • K-nearest neighbors • Voted perceptron If there are hyperparameters, you should tune them on your development data,notthetestdata. Tell us which dataset (2–9) you focused on and why. Fully (and concisely) explain the exploration and choices you made as you tuned hyperparameters. Report on how well you were able to perform on the test set, making a fair comparison to the perceptron. It’s fine to test more than one of the methods listed above, but you must implement everything yourself and you must report on all of the algorithms you tried (no tuning on test data, even to select the algorithm!). You are only required to test one of them.
4 PCAondigits[40points] Let’s do PCA and reconstruct the digits in the PCA basis. Download the file mnist.pkl.gz. Load the MNIST dataset in Python as follows.
6 of 10
If you use Python 2:
import gzip, pickle with gzip.open('mnist.pkl.gz') as f: train_set, valid_set, test_set = pickle.load(f) If you use Python 3:
import gzip, pickle with gzip.open('mnist.pkl.gz') as f: train_set, valid_set, test_set = pickle.load(f, encoding='bytes')
The array train set[0] contains the images, represented as vectors of 784 dimensions. Each example can be reshaped to a matrix of size 28×28. Due to that PCA acts on vectors, you will have to understand the conceptual (and algorithmic) conversion to vectors (and back so you can plot your reconstructions as images). As you will be making many visualizations of images, plot these images in a reasonable arrangement(e.g. maketheimagessmallerandarrangetheminsomeeasilyviewablesubplot. Points may be deducted for disorganized plots). Also, if you are using the python imshow() command, make sureyou understandhow to visualizegrayscale images. Also, this function, if providedwith a matrix of floats, needs the values to be between 0 and 1, so you may need to rescale an image when plotting it so that it lies between 0 and 1 (else the plots could look odd, as imshow() drops values out of this range).
4.1 Covariancematrixconstruction andtheadvantagesofmatrixoperations[13points] Letdbethenumberofdimensionsandnbethenumberofsamples. SupposeX isyourdatamatrix ofsize n×d. Let µ bethemean(column)vectorofyourdata. Letusnowcomputethecovariance matrix of the dataset. Throughout this problem you will use the (entire) training set. 1. (1 points) Choose 10 digits (make sure they are from different classes). Visualize these digits. In particular, include figures of these 10 digits. 2. (1 points) What are dimensions of the covariance matrix? What is the dimension of the mean vector µ? 3. (1 point) Visualize the mean image (and provide a figure). 4. (3 points) Let xi be a vector representing the i-th data point. Let us write out two methods for computing the covariance matrix: • vector based method: Write out the covariance matrix, Σ, in terms of the xi’s and µ (and not the matrix X). The method should have a sum over i in it. • matrix based method: Now, in matrix algebra notation, write out the covariance matrix as a function of the matrix X and the (column) vector µ (and there should be no sum over i). 5. (5 points) (Appreciating linear algebra) Compute the covariance matrix using two different methods. You must measure the run-time of both of the following methods:
7 of 10
• the vector based method: Compute the covariance matrix using your vector based algorithm, fromthepreviousquestion. Notethatthiswillinvolveaforloopoverall50Kdatapoints(and you may want to use the python linear algebra function ’outer’ to compute the outer product, asthetransposeoperationdoesnotworkonarrays). Thismethodmaytakealittletimeifyou are running it on your laptop. • thematrixbasedmethod: Computethecovariancematrixusingmatrixmultiplications,asyou wrote out earlier. Measure the runtime using a timing procedure in Python (not your own clock). List what time function you used. The time must reflect the time of just the compute operations and it should notreflectthetimetoloadthedatamatrices(asyoushouldhavethesealreadyloadedinmemory before your code starts the timer). Check that these two matrices give you the same answer. In particular,reporttheaverageabsolutedifference 1 d2Pi,j |Σvec. method i,j −Σmat. method i,j |(whichshouldbe numerically near to 0). How long did each method take? What is the percentage increase in runtime? 6. (2 points) (Looking under the hood) As you used Python and a standard linear algebra library (e.g. Numpy), both of these two methods were implemented on your computer with the same underlyingnumberofprimitivemathematicaloperations: underthehood,theybothwereimplemented with the same number of additions of (scalar) floating point numbers and same number of multiplications of (scalar) floating point numbers1. This leads to the puzzling question: why was your matrix based method (using matrix multiplications) was so much faster than your forloop based, vector code? A concise answer, hitting the important reason, is sufficient. You are free to do a google search to find the answer. (Hint: had the question asked you to code the vector based method in the C or C + + languages, you would not have observed much of a difference.) Implications: thiscomparisonhasimportantimplicationsforpracticalmachinelearning. Matrixalgebra structured algorithms are the bread and butter of modern machine learning for good reason. It is worth considering having linear algebra skills at your fingertips. It may also give you an appreciation of the value of GPU processers, which speed up matrix multiplication and other basic matrix algebra methods (such as FFTs) by factors between 10 and 100 times; the bottleneck of GPUs is often just copying data onto them, in blocks.
4.2 Eigen-Analysis[13points] Let Σ = UDU be the SVD of Σ. 1. (3points)Whataretheeigenvalues λ1, λ2, λ10, λ30,and λ50? Also,whatisthesumofeigenvaluesPd i=1 λi? 1Intheory,andrathersurprisingly,thematrixmultiplicationprocedurecanactuallybeimplementedwithasmaller number of float additions and float multiplications than that which is used by the vector based approach. One such algorithmisthefamousStrassen’salgorithm, whichleadtothe(nearlythetheoreticallyfastest)Coppersmith-Winograd algorithm. However, due to constant factors and realistic modern architecture constraints, these theoretically faster methods are rarely used; instead, the naive brute force approach to matrix multiplication is that which is under the hood in linear algebra packages.
8 of 10
2. (4points)Itisnotdifficulttoseethatthefractionalreconstructionerrorofusingthetop k outof d directionsis 1−Pk i=1 λi Pd i=1 λi . Plotthisfractionalreconstructionerrorfrom 1 to 100. (sothe X-axis is k and the Y-axis is the fractional reconstruction error). 3. (2 points) Roughly, what eigenvector captures 50% of the variance? 80% of the variance? 4. (3 points) Now let us get a sense of the what the top PCA directions are capturing (recall these are the directions which capture the most variance). Display the first 10 eigenvectors as images. 5. (1 points) Provide a brief interpretation of what you think they capture.
4.3 PseudocodeforReconstruction[5points] This problem will help you to implement your dimensionality reduction algorithm and your reconstruction algorithm with matrix algebra (and avoid writing a for-loop over all the data points). Throughout this problem, assume you will project each image down to k dimensions and that you wanttoobtainthebestreconstructiononthisdataset(intermsofthesquarederror,asdiscussedin class and in CIML). (Hint: Inthisproblem,itmayalsobehelpfultodefinean n-dimensional(column)vectorofall 1’s and to denote this vector by~ 1. Note that~ 1µ is an n×d matrix) 1. (1 points) The SVD will give you d eigenvectors, each corresponding to some eigenvalue λi. Which of these vectors (and how many of them) will you use? 2. (2 points) Specify a matrixe U which is of of size d×k that is constructed out of your chosen eigenvectors. Write a matrix algebra expression that reduces the dimension of all the points in your dataset. This expression should give you a matrix of size n × k, containing all of your dimensionalityreduceddata. Yourexpression shouldusematrixalgebraandshouldbestatedin terms of X, µ, ande U (this expression should have no sums in it). 3. (2points)Nowyoushouldhaveadatamatrixofsize n×k,whichconsistsofallyourdataafter thedimensionalityhasbeenreduced. Writeoutanexpressionthatwillgiveyouareconstruction ofallyourdata. Inparticular,youshouldrecoveramatrixofsize n×d. Yourmethodshouldbe stated using matrix algebra (and should have no sums in it). Be sure that your expression also adds the mean back into your data, else the reconstruction will not be correct.
4.4 VisualizationandReconstruction[9points] 1. (3 points) Now code your matrix algebra method that reconstructs X after it has projected it down to k dimensions. Do this for all 50K datapoints. For k = 30, time this procedure. How long does it take to project and reconstruct all 50K datapoints? 2. (5 points) Now let us visualize these reconstructed digits for different values of k. Visualize these reconstructions for all of the following cases (using the same 10 digits you visualized earlier). (a) Do this for k = 3. (b) Do this for k = 5. (c) Do this for k = 10.
9 of 10
(d) Do this for k = 25. (e) Do this for k = 50. (f) Do this for k = 200. 3. (1 point) Provide a brief interpretation, in terms of your perceptions of the quality of these reconstructions.
5 Conditional Probability Review: Revisiting the Bayes OptimalClassifier[8points] In class and in CIML, it is argued that, for binary prediction, the classifier defined by f(BO)(x) = argmaxy D(x,y) achieves the minimal expected classification error among all possible classifiers (CIML provides a proof). This classifier is referred to as the “Bayes Optimal Classifier”; the classifier must know the underlying distribution D. 1. (6 points) Show that an equivalent way to write this classifier is f(BO)(x) = argmaxy D(y|x). 2. (2 points) Give an intuitive reason (in words rather than with mathematics) why this classifier, written in terms of the conditional probability, is unbeatable.
6 Bonus: Dimensionality Reduction in Learning [up to 10 extrapoints] For the dataset you considered in part 3,reduceitsdimensionalityusingPCAandusetheresulting features instead of the original ones, with both the perceptron and the other algorithm(s) you implemented. For maximal extra points, you must give a compelling argument that PCA offered some benefit (and explain the benefit), or that it could not help. Be sure to explain how you chose k. Plots that help make your argument more clear are encouraged.
7 Bonus: The Bayes Optimal Hypothesis for “Regression” [up to10extrapoints] Suppose now that y is real valued. The square loss of a hypothesis f, where f maps an input to a real value, is defined as: ?(f) = E(x,y)∼D(y−f(x))2 . Prove that the Bayes optimal hypothesis for the square loss (i.e. the hypothesis that achieves the minimal possible square loss among all possible hypothesis) is: f(BO)(x) = E[y|x] (where the conditional expectation is with respect to D). You may prove this assuming that the number of possible x and y are finite, if that makes things conceptually easier for you.
10 of 10

More products