Starting from:

$35

Homework 2 MAP and MLE parameter estimation

Machine Learning Homework 2 Comp540 
The code base hw3.zip for the assignment is an attachment to Assignment 3 on Canvas. You will add your code at the indicated spots in the files there. Place your answers to Problems 1, 2, 3, 4, 5 (typeset) in a file called writeup.pdf. Please follow the new submission instructions. Set up a group for yourself if you haven’t already or your group composition has changed. When you submit, please submit the following THREE items as separate attachments before the due date and time: • Your writeup pdf, named writeup.pdf. • Your jupyter notebooks, saved in HTML format. If there are multiple notebooks, submit each one separately. • The zip file containing your work (code, etc.). Be sure that the datasets you use are NOT part of the zip, as this will increase the size considerably. 1 MAP and MLE parameter estimation (10 points) Consider a data set D = {x (i) |1 ≤ i ≤ m} where x (i) is drawn from a Bernoulli distribution with parameter θ. The elements of the data set are the results of the flips of a coin where x (i) = 1 represents heads and x (i) = 0 represents tails. We will estimate the parameter θ which is the probability of the coin coming up heads using the data set D. • (5 points) Use the method of maximum likelihood estimation to derive an estimate for θ from the coin flip results in D. • (5 points) Assume a beta prior distribution on θ with hyperparameters a and b. The beta distribution is chosen because it has the same form as the likelihood function for D derived under the Bernoulli model (such a prior is called a conjugate prior). Beta(θ|a, b) ∝ θ a−1 (1 − θ) b−1 Derive the MAP estimate for θ using D and this prior distribution. Show that under a uniform prior (Beta distribution with a = b = 1), the MAP and MLE estimates of θ are equal. 2 Logistic regression and Gaussian Naive Bayes (15 points) Consider a binary classification problem with dataset D = {(x (i) , y(i) )|1 ≤ i ≤ m; x (i) ∈ 2 Our probabilistic model hθ(x) is defined as hθ(x) =      P(y = 1|x; θ) P(y = 2|x; θ) . . . P(y = K|x; θ)      where P(y = k|x; θ) = exp(θ (k) T x) PK j=1 exp(θ (j) T x) The parameter θ is a (d+1)×K matrix, where each column represents the parameter vector for class k = 1, . . . , K. θ =         | | . . . | | | . . . | θ (1) θ (2) . . . θ(K) | | . . . | | | . . . |         Hint: numerical stability issues can come up in the computation of 6 P(y = k|x; θ). Consider K=3, and θ T x = [123, 456, 789]. To compute P(y = k|x; θ) from these scores, we need to calculate exp(123), exp(456) and exp(789), and sum them. These are very large numbers. However, we can get the same probabilities by subtracting the maximum (789) from every element in θ T x. Then we have the vector [−666, −333, 0], and we can calculate exp(−666), exp(−333) and exp(0), sum them (call the sum S) and then calculate exp(−666)/S, exp(−333/S) and exp(0)/S. The cost function J(θ) for softmax regression is derived from the negative log likelihood of the data D, assuming that P(y|x; θ) = hθ(x) as defined above. J(θ) = − 1 m Xm i=1 X K k=1 I{y (i) = k}log exp(θ (k) T x (i) ) PK j=1 exp(θ (j) T x (i) ) + λ 2m X d j=0 X K k=1 θ (k) j 2 where I{c} is the indicator function which evaluates to 1 when c is a true statement and to 0 otherwise. The second term is a regularization term, where λ is the regularization strength. While it is customary to exclude the bias term in L2 regularization, we include it here because it does not make a huge difference in the final result. You can check this for yourself on the CIFAR-10 dataset. You should implement this loss function using for loops for the summations in the function softmax loss naive in softmax.py. Once you have the loss function implemented, a cell in the notebook softmax cifar.ipynb will run your loss function for a randomly initialized θ matrix with 49000 training images and labels with λ set to 0. You should expect to see a value of about −loge(0.1) (Why?). Problem 7.2: Implementing the gradient of loss function for softmax regression (naive version) (5 points) The derivative of the loss function J(θ) with respect to the θ (k) is ∇θ (k)J(θ) = − 1 m Xm i=1 [x (i) (I{y (i) = k} − P(y (i) = k|x (i) ; θ))] + λ m θ (k) Implement the analytical derivative computation in softmax loss naive in softmax.py. Checking your gradient implementation with numerical gradients In the previous problem you have implemented the analytical gradient computed using calculus. We check your implementation of the gradient using the method of finite differences. The functions in gradient check.py compute the numerical gradient of a function f as follows: ∂f(x) ∂x = f(x + h) − f(x − h) 2h 7 plane car bird cat deer dog frog horse ship truck Figure 1: A sample of the CIFAR-10 dataset. 8 for a very small h. A cell in the softmax cifar.ipynb notebook will check your gradient against the numerically approximated gradient – you should expect to see differences between the two gradients of the order of 10−7 or less. Problem 7.3: Implementing the loss function for softmax regression (vectorized version) (10 points) Now complete the function softmax loss vectorized in softmax.py to implement the loss function J(θ) without using any for loops. Re-express the computation in terms of matrix operations on X, y and θ. Problem 7.4: Implementing the gradient of loss for softmax regression (vectorized version) (5 points) Now vectorize the gradient computation in softmax loss vectorized in softmax.py. Once you complete this, a cell in softmax cifar.ipynb will run and time your naive and vectorized implementations – you should expect to see at least one order of magnitude difference in run time between the two implementations. Problem 7.5: Implementing mini-batch gradient descent (5 points) In large-scale applications, the training data can have millions of examples. Hence, it seems wasteful to compute the loss function over the entire training set in order to perform only a single parameter update. A very common approach to addressing this challenge is to compute the gradient over batches of the training data. For example, a typical batch contains 256 examples from a training set of over 1.2 million. This batch is then used to perform a parameter update: θ (k) → θ (k) − α∇θ (k)J(θ) where α is the step size or learning rate for gradient descent. Implement mini-batch gradient descent in the method train in softmax.py using the description provided in the documentation of the method. You can set the verbose argument of train to be True and observe how the loss function varies with iteration number. Problem 7.6: Using a validation set to select regularization lambda and learning rate for gradient descent (5 points) There are many hyper parameters to be selected for mini batch gradient descent – the batch size, the number of iterations, and the learning rate. For the loss function, we also need to select λ, the regularization strength. In this exercise, we have pre-selected a batch size of 400 and an iteration count of 4000. Now, use the validation set provided to sweep the learning 9 plane car bird cat deer dog frog horse ship truck Figure 2: Visualization of θ for the CIFAR-10 dataset. The color scale is from red (low values) to blue (high values). rate and the λ parameter space, using the suggested values in softmax cifar.ipynb to find the best combination of these two hyper parameters. Fill in the code TODO in the marked cell in softmax cifar.ipynb. Problem 7.7: Training a softmax classifier with the best hyperparameters (5 points) Once you find the best values of λ and learning rate, insert code in the notebook softmax cifar.ipynb to train a softmax classifier on the training data with the best hyper parameters and save this classifier in the variable best softmax. The classifier will be evaluated on the set aside test set and you should expect to see overall accuracy of over 35%. Visualizing the learned parameter matrix We can remove the bias term from the θ matrix and reshape each column of the matrix which is a parameter vector of size 3072 back into an array of size 32 × 32 × 3 and visualize the results as an image. softmax cifar.ipynb constructs this plot and displays it inline. You should see a plot like the one shown in Figure 2. Compute the confusion matrix (you can use the sklearn function) on the test set for your predictor and interpret the visualized coefficients in the light of the errors made by the classifier. Extra credit: Problem 7.8: Experimenting with other hyper parameters and10 optimization method (10 points) We chose a batch size of 400 and 4000 iterations for our previous experiments. Explore larger and smaller batch sizes, choosing an appropriate number of iterations (by specifying a tolerance on differences in values of the loss function or its gradient in successive iterations) with the validation set. Produce plots that show the variation of test set accuracy as a function of batch size/number of iterations. You will have to determine the right settings for regularization strength λ and learning rate for each batch size/ number of iterations combination. What is the best batch size/number of iterations/learning rate/regularization strength combination for this problem? What is the best test set accuracy that can be achieved by this combination? Problem 7.9. Comparing OVA binary logistic regression with softmax regression (5 points) Compare the performance results from your OVA and softmax regression classifier. Provide a table with classification performance of each classifier on each CIFAR-10 category. Explain performance differences, if any, or explain why their performance is similar, if it is. Which approach would you recommend for the CIFAR-10 classification problem? Place answers to these questions in writeup.pdf.

More products