Starting from:

$29

Assignment 3: Binary Classification with Linear Regression on MNIST

Assignment 3 
1 BinaryClassificationwithLinearRegressiononMNIST (50points) Grab the “mnist 2 vs 9.gz” dataset on Canvas. In binary classification, we restrict Y to take on only two values. Suppose Y ∈ {0,1} ratherthan Y ∈{−1,1}. Let us consider the classification problem of recognizing if a digit is a “2” or a“9”; y = 1 is the label corresponding to class “2” and y = 0 corresponds to the class “9”.
2 of 16
Throughout this problem, all vectors (in written form) should refer to column vectors, by default. xn represents a vector of the pixel intensities of the image. If a row vector is needed, then use a transpose. Whenever we write X, Y orb Y, let us assume that they have the following sizes: X ∈RN×d, Y ∈RN×1,b Y ∈RN×1.Important: Be sure to include a “bias” term in this problem, else you will have significantly worse results. You can do this by adding an additional column (of all ones) to the train, test, and dev matrices. Remark: If, like the instructor, you prefer an easy hack if you think it will work, you could simply just overwrite the last column to be all ones, e.g. “Xtrain[:,783]=1”, “Xdev[:,783]=1”, “Xtest[:,783]=1”. This just overwrites some pixel in the corner of the image with a one; we not expect that this single pixel would be that helpful anyways. It should be clear that this is not the best practice.
1.1 LinearRegression,usingtheClosedFormEstimator(15points) Now let us use least squares linear regression for classification. Define the objective function (the cost function):
Lλ(w) :=
1 N
N X n=1
1 2
(yn −wxn)2 + λ 2kwk2
This can be equivalently written as:
Lλ(w) =
1 N
1 2kY −Xwk2 +
λ 2kwk2 (this form helps us compute the loss quickly). The optimization problem we seek to solve here is:
argmin w
Lλ(w)
and the solution is:
w∗ λ =?1 N
XX + λId?−1?1 N
XY?. where Id denotes the d×d identity matrix. (See the class notes and CIML. One can also directly derive this through differentiation.) For the purposes of classification using the square loss, we can use the following rule: label a digit as a “2”, i.e. y = 1, if w·x ≥ 1/2 (do you see why 1 2 is a reasonable threshold?). 1. (4 points) Let’s try out the vanilla least squares estimator. This corresponds to using λ = 0. Whathappened? Ifsomethingwentwrong,giveabriefdescriptionofwhathappenedandgivea compellingreasonastowhatiscausingitonthisparticulardataset(Hint: asyouhavevisualized these digits, you can actually provide a very specific reason.) 2. (10 points) Now choose an appropriate value of λ, based on trying out a few different values. Report: (a) (1 point) your choice of λ.
3 of 16
(b) (4 points) your average squared error, i.e. 1 N
1 2kY −Xwk2, on the training set, the dev set,and test set (you should have X and Y correspond to the training set, the dev set, and test set, respectively). (c) (5points)Reportyourmisclassificationerror(asapercentage)onthetrainingset,thedevset, and test set. Remark: While our decision rule said to use a 1/2, you are free to tune this threshold on the dev set (which is often done in practice); you might note that this leads to a notable drop in error. While note necessary, you are welcome to use a different threshold if you prefer to report a lower error. 3. (1 point) Why might linear regression be a poor idea for classification?
1.2 Linearregressionusinggradientdescent(20points) It is convenient to define: b yn = w·xn . We can interpretb yn as our prediction based on the n-th input due to using w; let us keep in mind thatb yn implicitly depends on w. 1. (4 points) Show that: dLλ(w) dw = −1 N N X n=1 (yn −b yn)xn + λw. If you are not comfortable taking the derivative directly respect to the vector w, you are free to instead take the derivative with respect to each of the d components, w = (w1,w2,...wd), of the vector w. You should be able to see that the gradient dLλ(w) dw is simply a vector of these d component wise derivatives. Remark: You may gain intuition for gradient descent by interpreting it as a certain “error driven” update based on the form of this gradient. 2. (4 points) In order to make our code faster, simplify this gradient expression, by removing the “sum over n”, with matrix algebra by expressing it in terms of X ∈ RN×d, Y ∈ RN×1, b Y ∈RN×1 (and other relevant quantities).3. (12 points) Now run gradient descent: (a) (2 points) What stepsize do you find works well? You might have to search around a little (it helpstosearchbytryingthingsoutlike1andappropriatelysearchingupordowninmultiples of 10). (b) (5 points) Make a plot showing your training averaged squared error, your development averagedsquarederror,andyourtestaveragedsquarederroronthe y-axisandtheiterationonthe x-axis. All three curves should be on one same plot. What value of λ did you use? Remark: When we run gradient descent (or stochastic gradient descent) descent, there is a sense in which the algorithm itself performs regularization. In this particular setting, you might note
4 of 16
that things to do not seem do diverge even if λ = 0 and may work just fine. This is a form of“earlystopping”;often,whenpreventouralgorithmfromrunning“toolong”weimplicitly control the model complexity. (c) (5 points) Make this plot again (with all three curves), except use the misclassification error, as a percentage, instead of average squared error. Here, make sure to start your x-axis at a slightlylateriteration(pastthefirstiteration),sothatyourerrorstartsbelow5%,whichmakes thebehaviormoreeasytoview(itisdifficulttoviewthelongrunbehaviorifthey-axisisover toolargearange). Whatisthelowesttestmisclassification%errorthatyoucanachieve? Itis expectedthatyouobtainagoodtesterror(meaningyoutrainlongenoughandyouregularize appropriately, if needed).
1.3 Linearregressionusingstochasticgradientdescent(15points) 1. (15 points) Now run stochastic gradient descent, using one point at a time: (a) (3 points) Roughly, what is the stepsize at which SGD starts to diverge? Why would you expect it to diverge at too large a learning rate? (b) (6points)Afterevery500updates(startingbeforeyourfirstupdateat0),makeaplotshowing your training average squared error, your development average squared error, and your test average squared error on the y-axis and the iteration on the x-axis. All three curves should be on one same plot. What value of λ did you use? Specify your learning rate scheme if you chose to decay your learning rate. (c) (6 points) Make this plot again (with all three curves, except use the misclassification error, as a percentage, instead of average squared error. Here, make sure to start your x-axis at a slightly later iteration, so that your error starts below 5%, which makes the behavior more easytoview(itisdifficulttoviewthelongrunbehaviorifthey-axisisovertoolargearange). Again, it is expected that you obtain a good test error (meaning you train long enough and you regularize appropriately, if needed). Report the lowest test error.
1.4 EXTRACREDIT:Mini-batch,stochasticgradientdescent(10points) Again, due to the manner in which matrix multiplication methods (as opposed to using “For Loops”) allow for faster runtimes (often through GPU processers), it is often much faster to use “mini-batch”methodsbysampling m pointsatatime. Byincreasingthebatchsize,wereducethe variance in stochastic gradient descent. In practice (and in theory), this tends to be very helpful as increase m, and then there tends to be (relatively sharp) diminishing returns. 1. Nowrunstochasticgradientdescent,usingamini-batchsizeof m = 100 pointsatatime. Here, each parameter updates means you use m = 100 randomly sampled training points. (a) (1 point) Roughly, what is the stepsize at which SGD starts to diverge? You might find it interesting that this stepsize is different than the m = 1 case. (b) (4 points) After every 500 updates (starting before your first update 0), make a plot showing your training average squared error, your development average squared error, and your aver
5 of 16
age test squared error on the y-axis and the iteration on the x-axis. All three curves should be on one same plot. What value of λ did you use (if you used it)? Specify your learning rate scheme if you chose to decay your learning rate. Remark Note that every update now touches 100 points. However, an update should not be 100 times slower (even though, technically, your computer is doing 100 as much computation). This is, again, due to how matrix multiplication is implemented. (c) (4 points) Make this plot again (with all three curves), except use the misclassification error, as a percentage, instead of average squared error. Here, make sure to start your x-axis at a slightly later iteration, so that your error starts below 5%, which makes the behavior more easytoview(itisdifficulttoviewthelongrunbehaviorifthey-axisisovertoolargearange). Again, it is expected that you obtain a good test error (meaning you train long enough and you regularize appropriately, if needed). Report the lowest test error. 2. (1 points) Comment on how your plots differ from using SGD with m = 1.
1.5 EXTRACREDIT:Usingpolynomialfeatures! (10points) Now let us “blow up” the dimensionality by including all the quadratic interactions along with the linear terms. Let us understand this by example. Suppose x is three dimensional, i.e. x = (x[1],x[2],x[3]). Define a new feature vector as follows: φ(x) = (1,x[1],x[2],x[3],x[1]2,x[2]2,x[3]2,x[1]x[2],x[1]x[3],x[2]x[3]). The first term is the bias term, the next three coordinates above are considered the “linear” terms, and the remaining terms are the quadratic terms. 1. (optional) What is the dimensionality of φ(x) for a d-dimensional input x? 2. (2points)Crudely(justgettherightorderofmagnitude),ifwedidthisonourimages(ofdimensionality d = 784), what would be the dimensionality of the original terms plus the quadratic interactions terms? Why might this be a poor idea? 3. (1 point) Instead, let us consider using building our feature vector using the d = 40 reduced dimension. Why is this a better idea? Construction: ObtainthedatasetfromCanvaswith 40 dimensionalPCAfeatures. Inthisdataset,theinstructor overwrotethelastPCAcoordinatewithaone. Youwillseewhythisisanicehackinamoment. Now make a new dataset to see what our classification error is with these quadratic plus linear interaction terms. You can easily build a dataset of 402 = 1600 features, which includes a bias term, a linear term, and all the quadratic terms. By simply using the original 40 PCA features (with the instructor’s hack baked in), you can do two four loops (over the 40 indices) to build all the features (including the bias, the linear features, and the quadratic terms). Do you see how to do this and why this get’s you the bias, the linear term, and the quadratic terms? I don’t mind a discussion of this in the discussion board, and I don’t mind if someone posts this feature construction code to the discussion board. 4. (7 points) Now run the algorithm of your choice (along with any regularization you deem appropriate). Report your average squared training error, your dev error, and your test error. Also report the corresponding misclassification errors.
6 of 16
Remark: You can see where this is going.... We could certainly try cubic features, etc. If you start on this path, regularization becomes important as does having fast algorithms.
2 BinaryClassificationwithLogisticRegression(10Points) Recall the probabilistic model:
pw(y = 1 | x) =
1 1 + exp(−w·x) pw(y = 0 | x) = 1−pw(y = 1 | x) =
1 1 + exp(w·x)
Our objective function here is:
Lλ(w) = −1 N
N X n=1
logpw(y = yn | xn) +
λ 2kwk2 .
Now let us minimize our cost. It is helpful to define:
b yn = pw(y = 1 | xn) You can view this as a probabilistic prediction. This definition will help in allowing you to easily modify your previous code. 1. (3 points) Show that:
dLλ(w) dw
= −1 N
N X n=1 (yn −b yn)xn + λw. Remark: You might find this expression to be rather curious! It looks identical to the expression for our gradient in the average squared error case. You are free to think about why this was a fortunate coincidence. The choice of y ∈{0,1}was indeed intentional.2. (1 point) Again, in order to make our code fast, simplify this gradient expresion with matrix algebra by expressing it in terms of X ∈ RN×d, Y ∈ RN×1,b Y ∈ RN×1 (and other relevantquantities). 3. (6 points) Let us understand a few properties of logistic regression. (a) Suppose our training data are linearly separable and λ = 0. What does our weight vector converge to? Why? (b) Suppose now that d ≥ n, λ = 0, and our n data points are all linearly independent. What does our weight vector converge to? Why? (c) Inbothofthesecases, whydoes regularizationor“earlystopping”makesense? Consider the implications for your true error in your answer.
7 of 16
2.1 EXTRACREDIT:Tryitout! (15points) Implement logistic regression on our “2” vs “9” dataset. Note: here you need to modify the decision rule: we should label a digit as a “2”, i.e. y = 1, if w·x ≥ 0 (do you see why?). 1. Now run GD or SGD (with your choice of batch size). (a) (1 point) Specify all your parameter choices (this must include m, λ, the step size scheme if you decay it). (b) (6 points) Show your log loss on the training set, the development set, and the test set (where the y-axis is the log loss and x-axis is the iteration). You should be able to convince yourself that the log loss (sometimes referred to as the cross entropy) can be computed as: −1 N X n (yn logb yn + (1−yn)log(1−b yn)) , which can be directly computed in python with operations on the vectors Y andb Y. All three curves should be on the same plot. What value of λ did you use? Specify your learning rate scheme if you chose to decay your learning rate. (c) (3 points) Make this plot again (with all three curves), except use the misclassification error, as a percentage, instead of the average squared error. Here, make sure to start your x-axis at a slightly later iteration, so that your error starts below 5%, which makes the behavior more easytoview(itisdifficulttoviewthelongrunbehaviorifthey-axisisovertoolargearange). Again, it is expected that you obtain a good test error (meaning you train long enough and you regularize appropriately, if needed). Report your lowest test error. 2. (5 points) Now try out using the quadratic interaction features of the PCA projections from before. Specify the algorithm you used along with your parameter choices (you do not need to provide plots). What misclassification % error did you achieve on the training set, the dev set, and test set?
3 Multi-ClassClassificationusingLeastSquares(20Points) The MNist dataset, http://yann.lecun.com/exdb/mnist/, has been historically interesting. Also,muchoftheearlierfocusoncertainmethodsandalgorithmsinmachinelearningwas partly due to obtaining good results on this dataset. If you work on the extra credit problem later on, you will gain a little more perspective on how many of these issues are not so relevant, once we start having a much more “flexible” representation. Let us now build a classifier for digit recognition on all 10 digits.
3.1 “Onevsallclassification”withLinearRegression(15Points) In the previous two class problem, we used linear regression with y ∈ {0,1}. Now we have 10 classes. Here we will use a “one-hot” encoding of the label. The label yn will be a 10 dimensional vector, where the k-th entry is 1 if the label is for the k-th class and all other entries will be 0.
8 of 16
1. (0 points) Create a label matrix of size Y ∈RN×10 for both your training set and your test set. Here, we can consider a vector valued prediction: b yn = W·xn . where sized W ∈Rd×10 matrix. As discussed in class, we can define the objective function here as:
Lλ(W) :=
1 N
N X n=1
1 2kyn −Wxnk2 + λ 2kWk2
where you can view the penalty as the sum of the squares of the entries of W. Note that this formulation is literally the same as doing k-binary classification problems on each of the classes separately, you will do a linear regression where you label a digit as Y = 1 if and only if the label for this digit is k (for k = 0,1,2,...9). It is straightforward to verify that the solution is:
W∗ λ =?1 N
XX + λId?−1?1 N
XY?.
Notethatherearestackingourthevectorsyn andb yn intothematricesY ∈RN×10 andb Y ∈RN×10. For classification, you will then take the largest predicted score among your 10 predictors. Def. of the misclassification error: We say a mistake is made on an example (x,y) if our prediction in{1,...k}does not equal the label y. The % misclassification error (on our training, dev, test, etc) is the % of such mistakes made by our prediction method on our, respective, dataset. Remark: This is sometimes referred to as “one against all” prediction. Note that we are just doing 10 separate linear regressions and are stacking our answers together. Also, the gradient of this loss function can be expressed as:
dLλ(w) dW
= −1 N
N X n=1
xn(yn −b yn) + λW . (1)
Note that this expression is of size d×k. Dataset YouwillusetheMNISTdatasetyouusedfromthelastassignment. Itcontainsall 10 digitswith the labels. The instructor will post a function on Canvas for computing the misclassification % error that you are free to use. 1. (3 points) Based on the above gradient expression, write out the matrix algebra expression for this gradient in terms of X ∈ RN×d, Y ∈ RN×10,b Y ∈ RN×10 (and other relevant quantities),where there is no “sum over n” in your expression. 2. (12 points) Decide which method you would like to use: the closed form expression, GD, or SGD. Specify your method along with all your parameters. On the training set, dev set, and test set, what are your average square losses and what is your misclassification % error?
9 of 16
3.2 EXTRACREDIT:Takeamatrixderivativeonyourown(5points) Prove equation 1.Todothis,lookupsomefactsaboutmatrixderivativesontheinternet(thereare all sorts of “matrix cookbooks”, “cheat sheets”, etc. out there). Provide the one or two rules for how one takes a matrix derivative to obtain the proof. The proof should be just a few steps. Also, youshouldbeabletoconvinceyourselfastohowthisfollows fromthevector proofyou did earlier.
3.3 EXTRACREDIT:Let’snotforgetaboutthesoftmax! (10Points) We now turn to the softmax classifier. Here, y takes values in the set{1,...k}. The model is as follows: we have k weight vectors, w(1),w(2),...w(k). For ` ∈{1,...k}, pW(y = `|x) = exp(w(`) ·x)P k i=1 exp(w(i) ·x) Again, note that this is a valid probability distribution (the probabilities are positive and they sum to 1). Also, note that we have “over-parameterized” the model, since:
pW(y = k|x) = 1−
k−1 X i=1
pW(y = i|x)
We could define the model without using w(k). However, the instructor likes this choice as the derivative expressions become a little simpler (and it makes it easier to re-use code). As before, it is helpful to define the “prediction vector”: b yn = pw(y | xn) whereweview pw(y | xn) as k-dimensional(column)vectorwherethe i-thcomponentis pW(y = i|xn).As before, the (negative) likelihood function on an N size training set is: Lλ(W) = −1 N N X n=1 logpW(y = yn | xn) + λ 2kWk2 . where the sum is over the N points in our training set (where yn ∈{1,...k}, so are we not using the one hot encoding in this expression). For classification, one can choose the class with the largest predicted probability. 1. (8 points) Write out the derivative of the log-likelihood of the soft-max function as a sum over the N datapoints in our training set and in terms of the vectors xn, the one hot encoding yn, and the (vector) predictionb yn. (Hint: at this point, you likely will be able to guess the answer. You should still be able to write out a concise derivation). Make sure the dimensions give you a gradient of size d×k.2. (2points)NowwriteoutthematrixalgebraexpressionforthisderivativeintermsofX ∈RN×d, Y ∈RN×10,b Y ∈RN×10 (and other relevant quantities). 10 of 16
4 ProbabilityandMaximumLikelihoodEstimation(30points) 4.1 ProbabilityReview(12points) 1. You are just told by your doctor that you tested positive for a serious disease. The test has 99% accuracy, which means that the probability of testing positive given that you have the disease is 0.99, and also that the probability of testing negative given that you do not have the disease is 0.99. The good news is that this is a rare disease, striking only 1 in 10,000 people. (a) (1 point) Why is it good news that the disease is rare? (b) (4 points) What is the probability that you actually have the disease? Show your work. 2. A group of students were classified based on whether they are senior or junior (random variable S) and whether they are taking CSE446 or not (random variable C). The following data was obtained. Junior (S = 0) Senior (S = 1) taking CSE446 (C = 1) 23 34 not taking CSE446 (C = 0) 41 53 Suppose a student was randomly chosen from the group. Calculate the following probabilities. Show your work. (a) (2 points) ˆ p(C = 1,S = 1) (b) (2 points) ˆ p(C = 1|S = 1)(c) (2 points) ˆ p(C = 0|S = 0) 3. Why are there “hats” on the p (for probability) symbols above? [1 point]
4.2 MaximumLikelihoodEstimation(18points) ThisquestionusesadiscreteprobabilitydistributionknownasthePoissondistribution. Adiscrete random variable X follows a Poisson distribution with parameter λ if
p(X = k) =
λk k!
e−λ ∀k ∈{0,1,2,...} You work for the city of Seattle picking up ballots from ballot dropoff boxes around town. You visittheboxnearUWeveryhouronthehourandpickuptheballotsthathavebeenleftthere. Here are the number of ballots you picked up this morning, starting at 1am. time 1am 2am 3am 4am 5am 6am 7am 8am new ballots picked up 6 4 2 7 5 1 2 5 Let G = hG1,...,GNi be a random vector where Gn is the number of ballots picked up on iteration n. 1. (6 points) Give the log-likelihood function of G given λ. 2. (6 points) Compute the MLE for λ in the general case. 3. (6 points) Compute the MLE for λ using the observed G.
11 of 16
5 EXTRACREDIT:Let’sgettostateoftheartonMNIST! (50points) Ifyouseekextracreditforthisproblem,youmustalsodotheextracreditproblemsinproblems one and two, which include the mini-batch SGD method, the quadratic features problem, and the logistic regression problem. If you do this, you should have the all the code you needtogetgoing! Also,youmustansweralloftheregularquestionsaswell. We will now shoot to get to “state of the art” on MNIST, without “distortions” or “preprocessing”. The table in http://yann.lecun.com/exdb/mnist/, shows which strategiesusethispre-processing. Infact,wewillshoottoget“stateoftheart”performancewithvanilla least squares. At the recent “NIPS” conference (one of the premier machine learning conferences), this talk, youtube video, created quite some buzz; it is related to how we will tackle this problem. There were many differing opinions expressed in subsequent discussions on the state of machine learning. Theinstructor’shopeisthat,withmorehandsonexperience,youwillbebetterinformed about the issues in play. The talk is relevant, since we are basically implementing the method discussed in this paper. The paper itself does not provide the most lucid justification; the method is really just a “quick and dirty” procedure to make features. In practice, there are often better feature generation methods; this one is remarkably simple. In this problem, we will engage in the bad practice where we do not have a dev set. To a large extent, looking “a little” at the test set is done in practice (and this shouldn’t hurt us too much if we understand how confidence intervals work). However, this has been done for quite sometime on this dataset, which is why the instructor is suspect of the test errors below 1.2%, among those methods that do no use “distortions” or “pre-processing” or “convolutional” methods (we should expect the latter methods to give performance bumps). The views of the instructor are that about 1.4% or less is “state of the art”, without “distortions” or “pre-processing” or “convolutional” methods (as discussed on the MNIST website). If we wanted even higer accuracy, we should really move to convolutional methods, which we may briefly discuss later in the class. Finally, the approach below might seem a little non-sensical. However, an important lesson is that large feature representations, appropriately blown up, often perform remarkably well once you have a lot of labeled data.
Makingthefeatures Grabthe“mnist all 50pca dims.gz”datasetonCanvas. Itcontainsallthedatapointsreduceddown to 50 dimensions. There is no dev set. And there are 60,000 training points. The inputs have been normalized so that the features vectors x are, on average, unit length, i.e. E[kxk2] = 1. Now let us try to make “better” features; we are not going to be particularly clever in the way we make these features, though they do provide remarkable improvements. Let x be an image (as avectorinRd). Nowwewillmapeach x toa k-dimensionalfeaturevectorasfollows: wewillfirst construct k randomvectors, v1, v2,..., vk (thesewillbesampledformaGaussiandistribution). In
12 of 16
other words, you first sample a matrix V ∈ Rd×k, where the columns of this matrix are v1 to vk; this can be done in python with the command np.random.randn(d,k). Then our feature vector will be the following vector:
φ(x) = (sin(2v 1 x),sin(2v 2 x),...sin(2v k x)) Note that φ(x) is a k dimensional vector; sin(·) is the usual trigonometric function; and the factor of 2 is a hyperparameter chosen by the instructor 1. You are welcome to try and alter the 2 to another value if you find it works better. Note that you only generate V once; you always use the same V whenever you compute φ. We will use (drumroll please....) k = 60,000 features. This seems like an unwieldy number. However, it will not actually be so bad since we you never actually explicitly construct and store this dataset. You will construct it “on the fly”.
Tips Withonlyyourlaptopinhand(orthecomputeresourcesprovided,whicharehopefullynotparticularlyimpressive),thisproblemisjusthardenoughthatitwillforceyoutoundersandmanyofthe issuesatplayinlargescalemachinelearning. Infact,ifyoutrytoexplicitlycontstructyourfeature matrix of size N ×k, which is of size 60,000×60,000, you will hopefully run out of memory. Regardless,theproblemisverymuchsolvable,inatimelymanner,withevenmeagercompute resources. The suggestions below are more broadly applicable to how we address many of the issues in large scale machine learning. • (mini-batching)Mini-batchinghelps. Itistoocostlytotrytofullgradientupdates. Usem = 50. • (memory) As the dimension is large in this problem, we seek to avoid explicitly computing and storing the full feature matrix, which is of size N ×k = N ×N. Instead, you can compute the featurevector φ(x) onthefly,i.e. yourecomputethevector φ(x) wheneveryouaccessanimage x. In particular, you must do this on your minibatch with matrix operations for your code to be fast enough. If e X is your m × d min-batch data matrix, do you see how the matrix sin2e XV relates to the features you desire? Here sin is applied component wise. • (regularization) It is up to you to determine how (and if) you set it. We do expect you to get good performance. • (learning rates) With the square loss case, I like to set my learning rates large. And then I decay them only if I need to. • (interrupting your code) Sometimes I find it helpful to be able to interrupt my code (with “CtrlC” or whatever you use) and have the ability to restart it without loosing my the current state of my parameters. Make sure you understand how to do this, and feel free to discuss this on the discussion board. This can be helpful. For example, for some problems, I may want to adjust my learning rate “by hand”, and this allows me to do this. 1The aforementioned normalization of the data by the instructor makes this factor of 2 naturally correspond to a certainscaleofthedata. Youcanunderstandthismorebylookingatthepaperinthelink. Itisanalogoustothechoice of a “bandwidth” in certain radial basis function kernel methods.
13 of 16
5.1 Let’simplementit,startingsmall. (18points) It is best to start small, which makes it easier for you to debug your code. Start with k = 5,000 features. 1. (2 points) Roughly, what is the stepsize at which SGD starts to diverge? What value of λ did you use (if you used it)? Specify your learning rate scheme if you chose to decay your learning rate. 2. (3points)Afterevery 500 updates,makeaplotshowingyourtrainingaveragesquarederrorand your test average squared error, with your average squared error on the y-axis and the iteration # on the x-axis. Both curves should be on one same plot. Also make sure to start these plots sufficiently many updates after 0 and to label the x-axis appropriately based on where you start plotting (if you start plotting at update 0, your plots will be difficult to read and interpret due to the average square loss initially dropping so quickly). 3. (3 points) For the misclassification error, make the same plots (again, with two curves. Do not start your plots at update 0. Make sure your plots are readable). Again, there should be two curves. 4. (2points)Plottheeuclideannorm oftheweightvector, wherethe x-axisistheiterationnumber andy-axisisthenormoftheweightvectoratthatupdate(youneedonlycompute/storethenorm every 500 iterations, as before). It is often helpful to plot the norms of you weight vectors, and you might find it striking how this curve behaves. 5. (3points)Whatisthelowesttrainingandtestaveragesquaredlossesachievedduringyourruns? Make sure you have run for long enough. 6. (5points)Whatisthelowesttrainingandtestmisclassificationerrorsachievedduringyourruns? What is the smallest number of total mistakes made (out of the 60K points) on your training set and on your test set (over all updates)? Note you can just derive this from your lowest misclassification % errors on your train and test sets, respectively. Comment on overfitting.
5.2 Gobig! (32points) Now let us use k = 60,000 features. Here, when you estimate your average squared and misclassification errors on your training set (for plotting purposes), you could use some fixed 10,000 training points (say the first 10K points in the training set) if you have issues with speed/memory (you do not want to ever create a 60K×60K matrix). If you do this, you should still ensure you aretrainingonall60Ktrainingpoints. Creditforthisproblemwillbebasedonthequalityofyour plots and your test error; we want you to figure out how to get good performance! 1. (2 points) Roughly, what is the stepsize at which SGD starts to diverge? What value of λ did you use (if you used it)? Specify your learning rate scheme if you chose to decay your learning rate. 2. (4points)Afterevery 500 updates,makeaplotshowingyourtrainingaveragesquarederrorand your test average squared error, with your average squared error on the y-axis and the iteration # on the x-axis. Both curves should be on one same plot. Also make sure to start these plots sufficiently many updates after 0 and to label the x-axis appropriately based on where you start
14 of 16
plotting (if you start plotting at update 0, your plots will be difficult to read and interpret due to the average square loss initially dropping so quickly). 3. (4 points) For the misclassification error, make the same plots (again, with two curves. Do not start your plots at update 0. Make sure your plots are readable). Again, there should be two curves. 4. (4points)Plottheeuclideannormoftheweightvector, wherethe x-axisistheiterationnumber andy-axisisthenormoftheweightvectoratthatupdate(youneedonlycompute/storethenorm every 500 iterations, as before). It is often helpful to plot the norms of you weight vectors, and you might find it striking how this curve behaves. 5. (2points)Whatisthelowesttrainingandtestaveragesquaredlossesachievedduringyourruns? Make sure you have run for long enough. 6. (8 points) What is the lowest training misclassification % error achieved over all of your runs? What is the smallest number of total mistakes made (out of the 60K points) on your training set and on your test set (over all updates)? Note you can just derive this from your lowest misclassification % errors on your train and test sets, respectively (remember that you need to divide by a factor of 100 when dealing with %’s!). If you estimated your training error with a 10K subset, then make sure to multiply by a 6 when estimating the total number of errors on your training set. 7. (8point)Provideashortdiscussion(aboutaparagraph)onoverfitting. Doyouseeyourtraining average squared error rise? Did you make an extremely small number of total mistakes on your training set and was this very different from your test set? Comment on your findings.
6 EXTRA CREDIT: Proving a rate of convergence for GD for theleastsquaresproblem(20points) This is one of the most fundamental convergence results in mathematical optimization. With a good understanding of the SVD, the proofs are short and within your reach. (Please only provide ananswerifyouknowyouhaveacorrectandrigorousproof;donotprovideguessesorincomplete proofs.) Let us consider gradient descent on the least squares problem.
L(w) =
1 N
1 2kY −Xwk2
Gradient descent is the update rule: w(k+1) = w(k) −η∇L(w(k)) Letλ1,λ2,...λd betheeigenvaluesof 1 N XX indescendingorder(soλ1 isthelargesteigenvalue). 1. (8 points) In terms of the aforementioned eigenvalues, what is threshold stepsize such that, for anyη abovethisthreshold,gradientdescentdiverges,and,foranyη belowthistheshold,gradient descentconverges? Youmustprovideatechnicallycorrectproofwithshortexplanationsofyour steps.
15 of 16
2. (8 points) Set η so that:
kw(k+1) −w∗k≤ exp(−κ)kw(k) −w∗k where κ is some (positive) scalar. In particular, set η so that κ is as large as possible. What is the value of η you used and what is κ? Again, you must provide a proof. (Hint: you can state these in terms of λ1 and λd.) The above equation shows a property called contraction. 3. (4 points) Now suppose that you want your parameter to be ? close to the optimal one, i.e. you seekkw(k) −w∗k≤ ?. How large does k need to be to guarantee this?
16 of 16

More products