Homework 2: Bayesian Methods and Multiclass Classification
Introduction
This homework is about Bayesian methods and multiclass classification. In lecture we have primarily focused on binary classifiers trained to discriminate between two classes. In multiclass classification, we discriminate between three or more classes. We encourage you to first read the Bishop textbook coverage of these topic, particularly: Section 4.2 (Probabilistic Generative Models), Section 4.3 (Probabilistic Discriminative Models). As usual, we imagine that we have the input matrix X ∈ Rn×m (or perhaps they have been mapped to some basis Φ, without loss of generality) but our outputs are now “one-hot coded”. What that means is that, if there are c output classes, then rather than representing the output label y as an integer 1,2,...,c, we represent y as a binary vector of length c. These vectors are zero in each component except for the one corresponding to the correct label, and that entry has a one. So, if there are 7 classes and a particular datum has label 3, then the target vector would be C3 = [0,0,1,0,0,0,0]. If there are c classes, the set of possible outputs is{C1 ...Cc} = {Ck}c k=1. Throughout the assignment we will assume that output y ∈{Ck}c k=1.
The problem set has three problems: • In the first problem, you will explore the properties of Bayesian estimation methods for the Bernoulli model as well as the special case of Bayesian linear regression with a simple prior. • In the second problem, you will dive into matrix algebra and the methods behind generative multiclass classifications. You will extend the discrete classifiers that we see in lecture to a Gaussian model. • Finally, in the third problem, you will implement logistic regression as well as a generative classifier from close to scratch.
Problem 1 (Bayesian Methods, 10 pts) This question helps to build your understanding of the maximum-likelihood estimation (MLE) vs. maximum a posterior estimator (MAP) and posterior predictive estimator, first in the Beta-Bernoulli model and then in the linear regression setting.
First consider the Beta-Bernoulli model (and see lecture 5.)
1. Write down the expressions for the MLE, MAP and posterior predictive distributions, and for a prior θ ∼ Beta(4,2) on the parameter of the Bernoulli, and with data D = 0,0,1,1,0,0,0,0,1,0,1,1, 1,0,1,0, plot the three different estimates after each additional sample.
2. Plot the posterior distribution (prior for 0 examples) on θ after 0, 4, 8, 12 and 16 examples. (Using whatever tools you like.)
3. Interpret the differences you see between the three different estimators. Second, consider the Bayesian Linear Regression model, with data D = {(xi,yi)}n i=1, xi ∈Rm, yi ∈R,and generative model yi ∼N(wxi,β−1) for (known) precision β (which is just the reciprocal of the variance). Given this, the likelihood of the data is p(y|X,w) = N(y|Xw,β−1I). Consider the special case of an isotropic (spherical) prior on weights, with p(w) = N(w|0,α−1I) This prior makes sense when you have little prior information and do not know much about the relationship among features so you can simplify by assuming independence.
4. Using the method in lecture of taking logs, expanding and pushing terms that don’t depend on w into a constant, and finally collecting terms and completing the square, confirm that the posterior on weights after data D is w ∼N(w|mn,Sn), where Sn = (αI + βXX)−1 mn = βSnXy
Problem 2 (Return of matrix calculus, 10pts) Consider now a generative c-class model. We adopt class prior p(y = Ck;π) = πk for all k ∈{1,...,c} (where πk is a parameter of the prior). Let p(x|y = Ck) denote the class-conditional density of features x (in this case for class Ck). Consider the data set D = {(xi,yi)}n i=1 where as above yi ∈{Ck}c k=1 is encoded as a one-hot target vector. 1. Write out the negated log-likelihood of the data set, −lnp(D;π). 2. Since the prior forms a distribution, it has the constraint thatPk πk −1 = 0. Using the hint on Lagrange multipliers below, give the expression for the maximum-likelihood estimator for the prior class-membership probabilities, i.e. ˆ πk. Make sure to write out the intermediary equation you need to solve to obtain this estimator. Double-check your answer: the final result should be very intuitive!
For the remaining questions, let the class-conditional probabilities be Gaussian distributions with the same covariance matrix p(x|y = Ck) = N(x|µk,Σ), for k ∈{1,...,c} and different means µk for each class. 3. Derive the gradient of the negative log-likelihood with respect to vector µk. Write the expression in matrix form as a function of the variables defined throughout this exercise. Simplify as much as possible for full credit.
4. Derive the maximum-likelihood estimator for vector µk. Once again, your final answer should seem intuitive.
5. Derive the gradient for the negative log-likelihood with respect to the covariance matrix Σ (i.e., looking to find an MLE for the covariance). Since you are differentiating with respect to a matrix, the resulting expression should be a matrix!
6. Derive the maximum likelihood estimator of the covariance matrix.
Hint: Lagrange Multipliers. Lagrange Multipliers are a method for optimizing a function f with respect to an equality constraint, i.e.
min x
f(x) s.t. g(x) = 0.
This can be turned into an unconstrained problem by introducing a Lagrange multiplier λ and constructing the Lagrangian function, L(x,λ) = f(x) + λg(x).
It can be shown that it is a necessary condition that the optimum is a critical point of this new function. We can find this point by solving two equations:
∂L(x,λ) ∂x
= 0 and
∂L(x,λ) ∂λ
= 0
Cookbook formulas. Here are some formulas you might want to consider using to compute difficult gradients. You can use them in the homework without proof. If you are looking to hone your matrix calculus skills, try to find different ways to prove these formulas yourself (will not be part of the evaluation of this homework). In general, you can use any formula from the matrix cookbook, as long as you cite it. We opt for the following common notation: X− := (X)−1
∂aX−1b ∂X
= −X−abX− ∂ ln|det(X)| ∂X = X−
3. Classifying Fruit [15pts]
You’re tasked with classifying three different kinds of fruit, based on their heights and widths. Figure 1 is a plot of the data. Iain Murray collected these data and you can read more about this on his website at http://homepages.inf.ed.ac.uk/imurray2/teaching/oranges_and_lemons/. We have made a slightly simplified (collapsing the subcategories together) version of this available as fruit.csv, which you will find in the Github repository. The file has three columns: type (1=apple, 2=orange, 3=lemon), width, and height. The first few lines look like this:
fruit,width,height 1,8.4,7.3 1,8,6.8 1,7.4,7.2 1,7.1,7.8 ...
4 6 8 10 4
5
6
7
8
9
10
11
Width (cm)
Height (cm)
Apples Oranges Lemons
Figure 1: Heights and widths of apples, oranges, and lemons. These fruit were purchased and measured by Iain Murray: http://homepages.inf.ed.ac.uk/imurray2/teaching/oranges_and_lemons/.
Problem 3 (Classifying Fruit, 15pts) You should implement the following: • The three-class generalization of logistic regression, also known as softmax regression, for these data. You will do this by implementing gradient descent on the negative log likelihood. You will need to find good values for the learning rate η and regularization strength λ. See the third practice problem in the section 3 notes for information about multi-class logistic regression, softmax, and negative log likelihood. • A generative classifier with Gaussian class-conditional densities, as in Problem 3. In particular, make two implementations of this, one with a shared covariance matrix across all of the classes, and one with a separate covariance being learned for each class. Note that the staff implementation can switch between these two by the addition of just a few lines of code. In the separate covariance matrix case, the MLE for the covariance matrix of each class is simply the covariance of the data points assigned to that class, without combining them as in the shared case.
You may use anything in numpy or scipy, except for scipy.optimize. That being said, if you happen to find a function in numpy or scipy that seems like it is doing too much for you, run it by a staff member on Piazza. In general, linear algebra and random variable functions are fine. The controller file is problem3.py, in which you will specify hyperparameters. The actual implementations you will write will be in LogisticRegression.py and GaussianGenerativeModel.py.
You will be given class interfaces for GaussianGenerativeModel and LogisticRegression in the distribution code, and the code will indicate certain lines that you should not change in your final submission. Naturally, don’t change these. These classes will allow the final submissions to have consistency. There will also be a few hyperparameters that are set to irrelevant values at the moment. You may need to modify these to get your methods to work. The classes you implement follow the same pattern as scikit-learn, so they should be familiar to you. The distribution code currently outputs nonsense predictions just to show what the high-level interface should be, so you should completely remove the given predict() implementations and replace them with your implementations. • The visualize() method for each classifier will save a plot that will show the decision boundaries. You should include these in this assignment. • Which classifiers model the distributions well? • What explains the differences? In addition to comparing the decision boundaries of the three models visually: • For logistic regression, plot negative log-likelihood loss with iterations on the x-axis and loss on the y-axis for several configurations of hyperparameters. Note which configuration yields the best final loss. Why are your final choices of learning rate (η) and regularization strength (λ) reasonable? How does altering these hyperparameters affect convergence? Focus both on the ability to converge and the rate at which it converges (a qualitative description is sufficient). • For both Gaussian generative models, report negative log likelihood. In the separate covariance matrix case, be sure to use the covariance matrix that matches the true class of each data point.
Finally, consider a fruit with width 4cm and height 11cm. To what class do each of the classifiers assign this fruit? What do these results tell you about the classifiers’ ability to generalize to new data? Repeat for a fruit of width 8.5cm and height 7cm.
Calibration [1pt]
Approximately how long did this homework take you to complete?