Assignment 4
1 Convexity: LinearandLogisticRegression(8points) Let us understand a few properties of linear regression (under the square loss) and logistic regression (under the log loss). Suppose we use sign(w · x) for predicting a label. Here, assume yn ∈{−1,1}. 1. (3points)Supposeourtrainingdataarelinearlyseparable. Supposewerungradientdescentfor the case of the linear regression: is our squared error converging to 0? Suppose we run gradient descent for the case of the logistic regression: is our log loss converging to 0? Why? 2. (2 points) Suppose now that d ≥ n, λ = 0, and our n data points are all linearly independent. Suppose we run gradient descent for the case of the linear regression: is our squared error converging to 0? Suppose we run gradient descent for the case of the logistic regression: is our log loss converging to 0? Why? 3. (3 points) Suppose we are running gradient descent (for the logistic regression problem under the log loss). Suppose at some iteration our misclassification error hits exactly 0 on our training set. If we continue to run gradient descent, do we expect that we will continue to update our parameters? Why or why not?
2 of 8
2 NeuralNetworksandNon-convexoptimization(42points) Assume that we have a fully interconnected neural network with L hidden layers and k output nodes(e.g. k = 10 formnist),sob y(x) isa k dimensionalvector. Supposethattheoutputatthelast layer is just the activation of the k output nodes, so there is no transfer function applied at the last layer. Assume that our input x is a d-dimensional vector and that we are working with the square loss.
2.1 ComputationalComplexity(andalittlemorelinearalgebra...) (14points) Suppose that multiplying a matrix of size n×k by a matrix of size k×d — resulting in a matrix of size n×d — takes time O(nkd) 1. Letusnowexaminesomecomputationalissuesandseehowtospeedupourcode(whenusing libraries like PyTorch and TensorFlow) by using matrix multiplications appropriately. Assumethatthetimetoevaluatethetransferfunction h(·) is Ch. Also,although h(·) isdefined as a mapping on a scalar variable, let us overload notation and allow us to apply h(·) to a vector (or a matrix), where we apply h(·) componentwise. So if v is a vector of length dv then h(v) will be a vector of length dv, and the computational time for this is dvCh. This overloading makes it relatively easy (and fast) to write code for the forward pass in a neural net. Assume that w(1),w(2),...w(L+1) are matrices of appropriate size (e.g. w(1) is a matrix of size d(1) ×d(0) where d(0) equals the input dimension d). 1. (3points)Giventheinputvector x,writeoutpseudocodefortheforwardpassintermsofmatrix multiplications, where you apply the transfer function to the vector of activations. The activations and outputs at any layer should be a vector. Specify the dimensions of all the activations and outputs. 2. (4 points) Now suppose we actually want to run the forward pass on a mini-batch of inputs, of size m. Suppose that X is an m×d matrix (so each row of X corresponds to an input vector now. This is the default use in the community). Now write out the forward pass, where you utilize matrix operations to whatever extent possible (and you apply h(·) componentwise): here youshouldhavematricesofactivationsandmatricesofoutputs,eachofwhichhas m rows. The output of the network will be the matrixb y(X), a matrix of size m×k. Specify the size of all the activations and outputs. Understanding how to do this is critical for writing efficient code in PyTorch and TensorFlow. (You should also understand how to compute the squared error efficiently without “for loops”). 3. (3 points) What is the computational time to compute the square loss on the mini-batch X in terms of d, the mini-batch size m, the number of nodes per level d(0),d(1),...,d(L+1) (note d(0) = d and d(L+1) = k), and the transfer function evaluation cost Ch? 1Do you see why O(nkd) is the naive runtime? It is worthwhile understanding why, as it is a straightforward argument. Also,thisistheruntimethatisactuallyobservedinpractice. Thetheoreticallyfasteralgorithms,mentioned in a previous footnote are not actually used in practice.
3 of 8
4. (3points)Assumethatthetimetoevaluatethegradientofthetransferfunctionis(withinaconstant of) Ch. What is the computational run time (using BackProp) to compute the gradient the square loss, again on the mini-batch X, with respect to all the parameters w(1),w(2),...w(L+1)? Why? 5. (1 points) Suppose we just wanted to compute the gradient of the parameters at the top level of the neural network, e.g. we just wanted to compute the ∂`(y,b y(X)) ∂w(L+1) . This partial derivative has far fewer parameters than computing the full gradient. Do you see a procedure whose runtime complexity is less than what you wrote above? If you see a procedure, write it out along with the runtime complexity. If you don’t see one, explain why you think it may be unreasonable to obtain a faster runtime. Remark: The above shows how it is easy to use linear algebra operations to write out the forward pass. Feel free to think about how you might write out the backward pass efficiently in terms of basic matrix operations! (This is a helpful exercise). If you think about it, you will see that there is no “direct” way to write out the backward pass in terms of matrix multiplications; we have to use Hadamard products, a componentwise multiplication operation. This does not necessarily mean the backward pass is any less efficient to compute (than what we discussed in class), provided one takes care in how one writes it in code. It does mean that writing your own algorithm to compute the backward pass takes a little care to make it fast (to avoid “for loops”), while the forward pass is relatively straightforward to code.
2.2 SaddlepointsandSymmetries(12points) Here, the underlying function of interest is the square loss on the training set. 1. (2 point) Suppose our MLP has 0 hidden layers (e.g. we are working with linear regression). Is setting all the weights to 0 a saddle point? Why or why not? 2. (3 points) Assume now our MLP has one or more hidden layers. Consider the tanh(·) transfer function. Suppose we set all our weights to 0. Is this point a saddle point? If so, give a precise proof that this is a saddle point. Else, give a counterexample. 3. (5 points) Again, assume our MLP has one or more hidden layers. Now let us consider the sigmoid transfer function. Consider initializing all the weights to 0. For this case, do we start at a saddle point? If we run gradient descent from this initialization, what can we say about how the weights in any given layer relate to each other? 4. (2 points) Consider the tanh(·) transfer function. Suppose we have a one hidden layer neural network. Suppose we flip the sign of all the weights. What happens tob y(x)? What happens in an L-hidden layer network? 2.3 RepresentationandNon-linearDecisionBoundaries(16points) Assume our target y is a scalar. Here, when dealing with an MLP, assume we just have one output node (where this node is linear in its activation). 1. Let us consider a linear and a quadratic transfer function:
4 of 8
(a) (4points)Supposewehavean L-layerneuralnetworkandthetransferfunctionistheidentity function. Provide an alternative model to using this network in which: 1) all local optima are global optima (and there are no saddle points) and 2) every function that can be represented in this neural network can be represented in your model. Provide concise reasoning for your answer. (b) (5 points) Suppose we have a one hidden layer neural network and the transfer function is h(a) = a2. Provide an alternative model to using this network which: 1) all local optima are global optima (and there are no saddle points) and 2) every function that can represented in this neural network can be represented in your model. Provide concise reasoning for your answer. (Hint: think about an appropriate feature mapping). 2. Let us consider the parity problem (sometimes referred to as the XOR problem) with two variables. We have seen this problem as it is our usual example of a non-separable dataset. It also has historical significance, due to Minksy and Papert [1] pointing out that the perceptron algorithm is not able to learn this function, which subsequently lead to a decreased interest in neural learning models. For two dimensional inputs, the XOR function is specified as follows: x = (x[1],x[2]),whereeachcoordinateof x isin{−1,1}and y ∈{−1,1}. Foragiveninput x, theXORfunctionassigns y tobe 1 ifandonlyifonlybothcoordinatesof x areequal,i.e. y = 1 if and only if x = (−1,−1) or x = (1,1). (a) (2 points) Give a feature mapping φ(x) using only linear and quadratic terms so that a linear classifier, i.e. sign(w·φ(x)), can represent this function with 0 misclassification error.(b) (5 points) Provide a one hidden layer neural network, with a tanh(·) transfer function and whichhasnomorethantwohiddennodes,whichcanrepresentthisfunction(with 0 misclassification error). You may assume your prediction is sign(b y(x)) (you are not allowed break ties in your favor ifb y(x) = 0). Specifically, you must provide your weights, and you are free to use a bias term in your activations (see Bishop or the lecture notes for the definition of a bias term in a neural network). Prove that your network’s predictions matches the XOR function. Remark: The XOR function with d dimensional inputs is more tricky to represent and far more problematic to learn from samples.
3 MLPs: Let’strythemoutonMNIST(40points) You will use the full dataset (the same as we used in assignment 2), with all 10 classes. You are free to use PyTorch, TensorFlow, or directly write your own backprop code. Please note which method you used. You will have 10 output nodes, one for each class. The loss function you should use is the square loss. You must use (mini-batch) SGD for this problem. We do expect you find a way to get reasonable performance (compared to what is achievable for the given architecture). Training neural nets can be a bit of an art, until you get used to it. Use regularization if you find it helpful. All your plots must have the x-axis correspond to the “effective” epoch number, so 1 unit on the x-axiscorrespondstowhen 50K trainingpointshavebeen“touched”(whenyoualgorithmhas
5 of 8
used 50K points). You should evaluate (and plot) your train, dev, and test set errors every “half epoch”, i.e. after 25K training points have been “touched”; e.g. if your mini-batch size is 10, then you should be evaluating your train, dev, and test set every 2500 iterations. If your batch size is 200, then these evaluations should be every 125 iterations. Remark: Notethataunitof5onyourx-axisiscomparabletoaunitof5ononeofyourfellow students x-axis (even if you both used different mini-batch sizes) in the following sense: at this point on both of your curves, you both touched the same number points (and performed the same number of basic computations) though you may have used a different number of iterations (if you had differing mini-batch sizes). Note that “wall-clock time” to get to 1-epoch will be determined byyourmini-batchsizeduetothat“mini-batchcomputations”canbedonemorequickly(provided you utilize matrix multiplications appropriately). This does not necessarily mean you want to use larger mini-batch sizes. Things you can try (please state which one you use): • You are free to try out/use “momentum” in the optimizer. The usual setting of the momentum parameter is 0.9. Almost everyone finds it is a pretty handy to keep momentum on (and just use 0.9 as the parameter). • You are free to try using the “softmax” at the top layer. In PyTorch, this is a one line change going to the “cross-entropy” loss.
3.1 Trytolearnaonehiddenlayernetwork. 1. (20points)Useaonehiddenlayernetworkwith d(1) = 100 hiddennodes. Thetransferfunction should be a sigmoid. Also, for each of the 10 output nodes, let us also use a sigmoid activation (so each output is bounded between 0 and 1). This means that the 10-dimensional output is b y(x) = h(w(2)z(1)), where h(·) is the sigmoid function (applied componentwise as w(2) is a 10×d(1) sized matrix). (a) (2 points) Now run stochastic (mini-batch) gradient descent. Specify all your parameter choices: your mini-batch size, your regularization parameter, and your learning rates (if you alterthelearningrates,makesureyoupreciselystatewhenitisdecreased). Also,specifyhow you initialize your parameters. (b) (3 points) For what learning rate do you start observing a non-trivial decrease in your error. If you learning rate is larger than this, do you diverge and do you get “NaN’s”? Why or why not? (c) (5points)Makeaplotshowingyourtrainingaveragesquarederror,yourdevelopmentaverage squarederror,andyouraveragetestsquarederroronthey-axisandtheiterationonthex-axis. All three curves should be on the same plot. (d) (5 points) Make this plot again (with all three curves), except use the misclassification error, as a percentage, instead of average squared error. Here, make sure to start your x-axis at a slightly later iteration, so that your error starts below 20%, which makes the behavior more easytoview(itisdifficulttoviewthelongrunbehaviorifthey-axisisovertoolargearange). Again, it is expected that you obtain a good test error (meaning you train long enough and you regularize appropriately, if needed). Report the lowest test error.
6 of 8
(e) (5 points) Note that the (input) weights to any hidden node correspond to a weighting over the image. This means you can try to visualize the (input) weights corresponding to any hidden node! Provide 8 plots corresponding to 8 different nodes (see Canvas for a function which shifts an image to be between 0 and 1 which could helpful in plotting). Provide a brief interpretation. 2. (20 points) Now replace all the transfer functions in the previous network with the “ReLu” transfer function. Again, provide answers to the previous 5 questions for this network.
3.2 EXTRACREDIT:BefreewithyourMLP!(10points) Now be creative and try to get good performance with an MLP. Merge your training set and the dev set (sigh...), as you may be interested in comparing your numbers to the MNIST table on http://yann.lecun.com/exdb/mnist/. You should try a network with more than one hidden layer. Again, provide answers to the previous 5 questions for this network (and specify the network architecture in the answer to the first question). Here, note that there is no dev set. Also, it is difficulttovisualizetheparametersinanylayerbeyondthefirsthiddenlayer,soonlymakeimages out of the first hidden layer. IfyoudidQ5fromthepreviousHW(andgotalowerror),thenIchallengeyoutobeatit! (This is not easy to do. It is pretty hard for the instructor to do better than Q5 with an MLP.)
4 EXTRACREDIT:ConvolutionalNeuralNetsonMNIST (25points) Try out a convolutional neural network. This is aconsiderablymore expensiveprocedure (though, with a little search, you might find your error drops very quickly with an architecture that is not toocostlyforyoutocompute.) Youwillfindthatyour’architecture’choicesaregovernedbyyour computational resources. This is not an unreasonable practical lesson: many of the choices we make in practice are largely due to our computational and/or memory constraints (when we have GPUs, we often run on large enough problems so that we are at the limits of our computational resources). Again, provide answers to the previous 5 questions. For this problem, in the first question, specify your architecture, including the size of your “filters”, your “pooling region” and your “stride” (I would use average pooling). As before, merge your dev set in the training set. Also, note that your visualized weights are going to be the learned filters (so they will be a smaller size than the image), which will depend on the size of the filters that you use.
7 of 8
5 EXTRACREDIT:Non-convexoptimizationandconvergence ratestostationarypoints(25points) (To obtain credit on this problem, you must do the first part of this question. ) Let us say a function F : Rd →Ris L -smooth if k∇F(w)−∇F(w0)k≤ Lkw−w0k, where the norm is the Euclidean norm. In other words, the derivatives of F do not change too quickly. Gradient descent, with a constant learning rate, is the algorithm: w(k+1) = w(k) −η·∇F(w(k)) In this question, we do not assume that F is convex. If you find it helpful, you can assume that F is twice differentiable. 1. (15 points) Now let us bound the function value decrease at every step. In particular, show that the following holds (for all η): F(w(k+1)) ≤ F(w(k))−ηk∇F(w(k))k2 + 1 2 η2Lk∇F(w(k))k2 Itisfinetoprovealooserversionofthisbound,wherethefactorof1/2isreplacedby1. Brownie pointsifyougetthefactorof 1/2. (Hint: Taylor’stheoremisthenaturalstartingpoint. Youmay also want to consider what smoothness implies about the second derivative. If you think about the intermediate value theorem, you can actually get the factor of 1/2). 2. (3points)Letusnowshowthatifthegradientislarge,thenitispossibletosubstantiallydecrease the function value. Precisely, show that with an appropriate setting of η, we have that: F(w(k+1)) ≤ F(w(k))− 1 2Lk∇F(w(k))k2 3. (7 points) Now let F(w∗) be the minimal function value (i.e. the value at the global minima). Arguethatgradientdescentwillfinda w(k) thatis“almost”astationarypointinabounded(and polynomial) number of steps. Precisely, show that there exist some k where:
k ≤
2L(F(w(0))−F(w∗)) ?
such that
k∇F(w(k))k2 ≤ ?. (Hint: you could consider a proof by contradiction. Also, note that k∇F(w(k))k may not be decreasing at every step.)
References [1] M. Minsky and S. Papert. Perceptrons: An Introduction to Computational Geometry. MIT Press, Cambridge, MA, 1969.
8 of 8