Homework 2: Bayesian Methods and Multiclass Classiﬁcation

Your shopping cart is empty.

Homework 2: Bayesian Methods and Multiclass Classiﬁcation
Introduction
This homework is about Bayesian methods and multiclass classiﬁcation. In lecture we have primarily focused on binary classiﬁers trained to discriminate between two classes. In multiclass classiﬁcation, we discriminate between three or more classes. We encourage you to ﬁrst read the Bishop textbook coverage of these topic, particularly: Section 4.2 (Probabilistic Generative Models), Section 4.3 (Probabilistic Discriminative Models). As usual, we imagine that we have the input matrix X ∈ Rn×m (or perhaps they have been mapped to some basis Φ, without loss of generality) but our outputs are now “one-hot coded”. What that means is that, if there are c output classes, then rather than representing the output label y as an integer 1,2,...,c, we represent y as a binary vector of length c. These vectors are zero in each component except for the one corresponding to the correct label, and that entry has a one. So, if there are 7 classes and a particular datum has label 3, then the target vector would be C3 = [0,0,1,0,0,0,0]. If there are c classes, the set of possible outputs is{C1 ...Cc} = {Ck}c k=1. Throughout the assignment we will assume that output y ∈{Ck}c k=1.
The problem set has three problems: • In the ﬁrst problem, you will explore the properties of Bayesian estimation methods for the Bernoulli model as well as the special case of Bayesian linear regression with a simple prior. • In the second problem, you will dive into matrix algebra and the methods behind generative multiclass classiﬁcations. You will extend the discrete classiﬁers that we see in lecture to a Gaussian model. • Finally, in the third problem, you will implement logistic regression as well as a generative classiﬁer from close to scratch.
Problem 1 (Bayesian Methods, 10 pts) This question helps to build your understanding of the maximum-likelihood estimation (MLE) vs. maximum a posterior estimator (MAP) and posterior predictive estimator, ﬁrst in the Beta-Bernoulli model and then in the linear regression setting.
First consider the Beta-Bernoulli model (and see lecture 5.)
1. Write down the expressions for the MLE, MAP and posterior predictive distributions, and for a prior θ ∼ Beta(4,2) on the parameter of the Bernoulli, and with data D = 0,0,1,1,0,0,0,0,1,0,1,1, 1,0,1,0, plot the three diﬀerent estimates after each additional sample.
2. Plot the posterior distribution (prior for 0 examples) on θ after 0, 4, 8, 12 and 16 examples. (Using whatever tools you like.)
3. Interpret the diﬀerences you see between the three diﬀerent estimators. Second, consider the Bayesian Linear Regression model, with data D = {(xi,yi)}n i=1, xi ∈Rm, yi ∈R,and generative model yi ∼N(wxi,β−1) for (known) precision β (which is just the reciprocal of the variance). Given this, the likelihood of the data is p(y|X,w) = N(y|Xw,β−1I). Consider the special case of an isotropic (spherical) prior on weights, with p(w) = N(w|0,α−1I) This prior makes sense when you have little prior information and do not know much about the relationship among features so you can simplify by assuming independence.
4. Using the method in lecture of taking logs, expanding and pushing terms that don’t depend on w into a constant, and ﬁnally collecting terms and completing the square, conﬁrm that the posterior on weights after data D is w ∼N(w|mn,Sn), where Sn = (αI + βXX)−1 mn = βSnXy
Problem 2 (Return of matrix calculus, 10pts) Consider now a generative c-class model. We adopt class prior p(y = Ck;π) = πk for all k ∈{1,...,c} (where πk is a parameter of the prior). Let p(x|y = Ck) denote the class-conditional density of features x (in this case for class Ck). Consider the data set D = {(xi,yi)}n i=1 where as above yi ∈{Ck}c k=1 is encoded as a one-hot target vector. 1. Write out the negated log-likelihood of the data set, −lnp(D;π). 2. Since the prior forms a distribution, it has the constraint thatPk πk −1 = 0. Using the hint on Lagrange multipliers below, give the expression for the maximum-likelihood estimator for the prior class-membership probabilities, i.e. ˆ πk. Make sure to write out the intermediary equation you need to solve to obtain this estimator. Double-check your answer: the ﬁnal result should be very intuitive!
For the remaining questions, let the class-conditional probabilities be Gaussian distributions with the same covariance matrix p(x|y = Ck) = N(x|µk,Σ), for k ∈{1,...,c} and diﬀerent means µk for each class. 3. Derive the gradient of the negative log-likelihood with respect to vector µk. Write the expression in matrix form as a function of the variables deﬁned throughout this exercise. Simplify as much as possible for full credit.
4. Derive the maximum-likelihood estimator for vector µk. Once again, your ﬁnal answer should seem intuitive.
5. Derive the gradient for the negative log-likelihood with respect to the covariance matrix Σ (i.e., looking to ﬁnd an MLE for the covariance). Since you are diﬀerentiating with respect to a matrix, the resulting expression should be a matrix!
6. Derive the maximum likelihood estimator of the covariance matrix.
Hint: Lagrange Multipliers. Lagrange Multipliers are a method for optimizing a function f with respect to an equality constraint, i.e.
min x
f(x) s.t. g(x) = 0.
This can be turned into an unconstrained problem by introducing a Lagrange multiplier λ and constructing the Lagrangian function, L(x,λ) = f(x) + λg(x).
It can be shown that it is a necessary condition that the optimum is a critical point of this new function. We can ﬁnd this point by solving two equations:
∂L(x,λ) ∂x
= 0 and
∂L(x,λ) ∂λ
= 0
Cookbook formulas. Here are some formulas you might want to consider using to compute diﬃcult gradients. You can use them in the homework without proof. If you are looking to hone your matrix calculus skills, try to ﬁnd diﬀerent ways to prove these formulas yourself (will not be part of the evaluation of this homework). In general, you can use any formula from the matrix cookbook, as long as you cite it. We opt for the following common notation: X− := (X)−1
∂aX−1b ∂X
= −X−abX− ∂ ln|det(X)| ∂X = X−
3. Classifying Fruit [15pts]
You’re tasked with classifying three diﬀerent kinds of fruit, based on their heights and widths. Figure 1 is a plot of the data. Iain Murray collected these data and you can read more about this on his website at http://homepages.inf.ed.ac.uk/imurray2/teaching/oranges_and_lemons/. We have made a slightly simpliﬁed (collapsing the subcategories together) version of this available as fruit.csv, which you will ﬁnd in the Github repository. The ﬁle has three columns: type (1=apple, 2=orange, 3=lemon), width, and height. The ﬁrst few lines look like this:
fruit,width,height 1,8.4,7.3 1,8,6.8 1,7.4,7.2 1,7.1,7.8 ...
4 6 8 10 4
5
6
7
8
9
10
11
Width (cm)
Height (cm)
Apples Oranges Lemons
Figure 1: Heights and widths of apples, oranges, and lemons. These fruit were purchased and measured by Iain Murray: http://homepages.inf.ed.ac.uk/imurray2/teaching/oranges_and_lemons/.
Problem 3 (Classifying Fruit, 15pts) You should implement the following: • The three-class generalization of logistic regression, also known as softmax regression, for these data. You will do this by implementing gradient descent on the negative log likelihood. You will need to ﬁnd good values for the learning rate η and regularization strength λ. See the third practice problem in the section 3 notes for information about multi-class logistic regression, softmax, and negative log likelihood. • A generative classiﬁer with Gaussian class-conditional densities, as in Problem 3. In particular, make two implementations of this, one with a shared covariance matrix across all of the classes, and one with a separate covariance being learned for each class. Note that the staﬀ implementation can switch between these two by the addition of just a few lines of code. In the separate covariance matrix case, the MLE for the covariance matrix of each class is simply the covariance of the data points assigned to that class, without combining them as in the shared case.
You may use anything in numpy or scipy, except for scipy.optimize. That being said, if you happen to ﬁnd a function in numpy or scipy that seems like it is doing too much for you, run it by a staﬀ member on Piazza. In general, linear algebra and random variable functions are ﬁne. The controller ﬁle is problem3.py, in which you will specify hyperparameters. The actual implementations you will write will be in LogisticRegression.py and GaussianGenerativeModel.py.
You will be given class interfaces for GaussianGenerativeModel and LogisticRegression in the distribution code, and the code will indicate certain lines that you should not change in your ﬁnal submission. Naturally, don’t change these. These classes will allow the ﬁnal submissions to have consistency. There will also be a few hyperparameters that are set to irrelevant values at the moment. You may need to modify these to get your methods to work. The classes you implement follow the same pattern as scikit-learn, so they should be familiar to you. The distribution code currently outputs nonsense predictions just to show what the high-level interface should be, so you should completely remove the given predict() implementations and replace them with your implementations. • The visualize() method for each classiﬁer will save a plot that will show the decision boundaries. You should include these in this assignment. • Which classiﬁers model the distributions well? • What explains the diﬀerences? In addition to comparing the decision boundaries of the three models visually: • For logistic regression, plot negative log-likelihood loss with iterations on the x-axis and loss on the y-axis for several conﬁgurations of hyperparameters. Note which conﬁguration yields the best ﬁnal loss. Why are your ﬁnal choices of learning rate (η) and regularization strength (λ) reasonable? How does altering these hyperparameters aﬀect convergence? Focus both on the ability to converge and the rate at which it converges (a qualitative description is suﬃcient). • For both Gaussian generative models, report negative log likelihood. In the separate covariance matrix case, be sure to use the covariance matrix that matches the true class of each data point.
Finally, consider a fruit with width 4cm and height 11cm. To what class do each of the classiﬁers assign this fruit? What do these results tell you about the classiﬁers’ ability to generalize to new data? Repeat for a fruit of width 8.5cm and height 7cm.
Calibration [1pt]
Approximately how long did this homework take you to complete?

Shopping cart

US$0

Homework 2: Bayesian Methods and Multiclass Classiﬁcation

More products