Homework 1 The homework is generally split into programming exercises and written exercises.
PROGRAMMING EXERCISES 1. Digit Recognizer (a) Join the Digit Recognizer competition on Kaggle. Download the training and test data. The competition page describes how these files are formatted. (b) Write a function to display an MNIST digit. Display one of each digit. (c) Examine the prior probability of the classes in the training data. Is it uniform across the digits? Display a normalized histogram of digit counts. Is it even? (d) Pick one example of each digit from your training data. Then, for each sample digit, compute and show the best match (nearest neighbor) between your chosen sample and the rest of the training data. Use L2 distance between the two images’ pixel values as the metric. This probably won’t be perfect, so add an asterisk next to the erroneous examples. (e) Consider the case of binary comparison between the digits 0 and 1. Ignoring all the other digits, compute the pairwise distances for all genuine matches and all impostor matches, again using the L2 norm. Plot histograms of the genuine and impostor distances on the same set of axes. (f) Generate an ROC curve from the above sets of distances. What is the equal error rate? What is the error rate of a classifier that simply guesses randomly? (g) Implement a K-NN classifier. (h) Using the training data for all digits, perform 3 fold cross-validation on your K-NN classifier and report your average accuracy. (i) Generate a confusion matrix (of size 10 £ 10) from your results. Which digits are particularly tricky to classify? (j) Train your classifier with all of the training data, and test your classifier with the test data. Submit your results to Kaggle. 2. The Titanic Disaster (a) Join the Titanic: Machine Learning From Disaster competition on Kaggle. Download the training and test data. (b) Using logistic regression, try to predict whether a passenger survived the disaster. You can choose the features (or combinations of features) you would like to use or ignore, provided you justify your reasoning. (c) Train your classifier using all of the training data, and test it using the testing data. Submit your results to Kaggle. 2CS5785 Fall 2015: Homework 1 Page 3 WRITTEN EXERCISES 1. Variance of a sum. Show that the variance of a sum is var [X +Y ] = var [X ]+ var [Y ]+2cov[X ,Y ], where cov[X ,Y ] is the covariance between random variables X and Y . 2. Bayes rule for medical diagnosis (Source: Koller) After your yearly checkup, the doctor has bad news and good news. The bad news is that you tested positive for a serious disease, and that the test is 99% accurate (i.e., the probability of testing positive given that you have the disease is 0.99, as is the probability of testing negative given that you do not have the disease). The good news is that this is a rare disease, striking only one in 10,000 people. What are the chances that you actually have the disease? (Show your calculations as well as giving the final result.) 3. Gradient and Hessian of log-likelihood for logistic regression. (a) Let æ(a) = 1 1+ e°a be the sigmoid function. Show that dæ(a) da = æ(a)(1°æ(a)). (b) Using the previous result and the chain rule of calculus, derive the expression for the gradient of the log likelihood given in HTF Eqn. 4.21. (c) As noted in HTF Eqn. 4.25, the Hessian matrix for the log likelihood can be written (up to a sign) as XWX. Prove that this matrix is positive definite. 3