Starting from:

$29.99

Homework 2: Bayesian Methods and Multiclass Classification

Homework 2: Bayesian Methods and Multiclass Classification
Introduction
This homework is about Bayesian methods and multiclass classification. In lecture we have primarily focused on binary classifiers trained to discriminate between two classes. In multiclass classification, we discriminate between three or more classes. We encourage you to first read the Bishop textbook coverage of these topic, particularly: Section 4.2 (Probabilistic Generative Models), Section 4.3 (Probabilistic Discriminative Models). As usual, we imagine that we have the input matrix X ∈ Rn×m (or perhaps they have been mapped to some basis Φ, without loss of generality) but our outputs are now “one-hot coded”. What that means is that, if there are c output classes, then rather than representing the output label y as an integer 1,2,...,c, we represent y as a binary vector of length c. These vectors are zero in each component except for the one corresponding to the correct label, and that entry has a one. So, if there are 7 classes and a particular datum has label 3, then the target vector would be C3 = [0,0,1,0,0,0,0]. If there are c classes, the set of possible outputs is{C1 ...Cc} = {Ck}c k=1. Throughout the assignment we will assume that output y ∈{Ck}c k=1.
The problem set has four problems: • In the first problem, you will explore the properties of Bayesian estimation methods for the Bernoulli model as well as the special case of Bayesian linear regression with a simple prior. • In the second problem, you will explore the properties of the softmax function, which is central to the method of multiclass logistic regression. • In the third problem, you will dive into matrix algebra and the methods behind generative multiclass classifications. You will extend the discrete classifiers that we see in lecture to a Gaussian model. • Finally, in the fourth problem, you will implement logistic regression as well as a generative classifier from close to scratch.
Problem 1 (Bayesian Methods, 10 pts) This question helps to build your understanding of the maximum-likelihood estimation (MLE) vs. maximum a posterior estimator (MAP) and posterior predictive estimator, first in the Beta-Bernoulli model and then in the linear regression setting.
First consider the Beta-Bernoulli model (and see lecture 5.)
1. Write down the expressions for the MLE, MAP and posterior predictive distributions, and for a prior θ ∼ Beta(4,2) on the parameter of the Bernoulli, and with data D = 0,0,1,1,0,0,0,0,1,0,1,1, 1,0,1,0, plot the three different estimates after each additional sample.
2. Plot the posterior distribution (prior for 0 examples) on θ after 0, 4, 8, 12 and 16 examples. (Using whatever tools you like.)
3. Interpret the differences you see between the three different estimators. Second, consider the Bayesian Linear Regression model, with data D = {(xi,yi)}n i=1, xi ∈Rm, yi ∈R,and generative model yi ∼N(wxi,β−1) for (known) precision β (which is just the reciprocal of the variance). Given this, the likelihood of the data is p(y|X,w) = N(y|Xw,β−1I). Consider the special case of an isotropic (spherical) prior on weights, with p(w) = N(w|0,α−1I) 4. Justify when you might use this prior in practice.
5. Using the method in lecture of taking logs, expanding and pushing terms that don’t depend on w into a constant, and finally collecting terms and completing the square, confirm that the posterior on weights after data D is w ∼N(w|mn,Sn), where Sn = (αI + βXX)−1 mn = βSnXy
6. Derive the special case of the MAP estimator for this problem as the isotropic prior becomes arbitrarily weak. What does the MAP estimator reduce to?
7. What did we observe in lecture about this estimator for the case where the prior is neither weak nor strong?
Problem 2 (Properties of Softmax, 8pts) We have explored logistic regression, which is a discriminative probabilistic model over two classes. For each input x, logistic regression outputs a probability of the class output y using the logistic sigmoid function.
The softmax transformation is an important generalization of the logistic sigmoid to the case of c classes. It takes as input a vector, and outputs a transformed vector of the same size,
softmax(z)k =
exp(zk) Pc `=1 exp(z`)
, for all k
Multiclass logistic regression uses the softmax transformation over vectors of size c. Let {w`} = {w1 ...wc} denote the parameter vectors for each class. In particular, multiclass logistic regression defines the probability of class k as,
p(y = Ck|x;{w`}) = softmax([w 1 x...w c x])k = exp(w k x) Pc `=1 exp(w ` x)
.
As above, we are using y = Ck to indicate the output vector that represents class k. Assuming data D = {(xi,yi)}n i=1, the negated log-likelihood can be written in the standard form, as
L({w`}) = −
n X i=1
lnp(yi|xi;{w`})
Softmax is an important function in the context of machine learning, and you will see it again in other models, such as neural networks. In this problem, we aim to gain intuitions into the properties of softmax and multiclass logistic regression.
Show that:
1. The output of the softmax function is a vector with non-negative components that are at most 1.
2. The output of the softmax function defines a distribution, so that in addition, the components sum to 1.
3. Softmax preserves order. This means that if elements zk < z`, in z, then softmax(z)k < softmax(z)` for any k,`.
4. Show that
∂softmax(z)k ∂zj
= softmax(z)k(Ikj −softmax(z)j) for any k,j , where indicator Ikj = 1 if k = j and Ikj = 0 otherwise.
5. Using your answer to the previous question, show that
∂ ∂wkL({w`}) =
n X i=1
(p(yi = Ck|xi;{w`})−yik)xi
By the way, this may be useful for Problem 3!
Problem 3 (Return of matrix calculus, 10pts) Consider now a generative c-class model. We adopt class prior p(y = Ck;π) = πk for all k ∈{1,...,c} (where πk is a parameter of the prior). Let p(x|y = Ck) denote the class-conditional density of features x (in this case for class Ck). Consider the data set D = {(xi,yi)}n i=1 where as above yi ∈{Ck}c k=1 is encoded as a one-hot target vector. 1. Write out the negated log-likelihood of the data set, −lnp(D;π). 2. Since the prior forms a distribution, it has the constraint thatPk πk −1 = 0. Using the hint on Lagrange multipliers below, give the expression for the maximum-likelihood estimator for the prior class-membership probabilities, i.e. ˆ πk. Make sure to write out the intermediary equation you need to solve to obtain this estimator. Double-check your answer: the final result should be very intuitive!
For the remaining questions, let the class-conditional probabilities be Gaussian distributions with the same covariance matrix p(x|y = Ck) = N(x|µk,Σ), for k ∈{1,...,c} and different means µk for each class. 3. Derive the gradient of the negative log-likelihood with respect to vector µk. Write the expression in matrix form as a function of the variables defined throughout this exercise. Simplify as much as possible for full credit.
4. Derive the maximum-likelihood estimator for vector µk. Once again, your final answer should seem intuitive.
5. Derive the gradient for the negative log-likelihood with respect to the covariance matrix Σ (i.e., looking to find an MLE for the covariance). Since you are differentiating with respect to a matrix, the resulting expression should be a matrix!
6. Derive the maximum likelihood estimator of the covariance matrix.
[Hint: Lagrange Multipliers. Lagrange Multipliers are a method for optimizing a function f with respect to an equality constraint, i.e.
min x
f(x) s.t. g(x) = 0.
This can be turned into an unconstrained problem by introducing a Lagrange multiplier λ and constructing the Lagrangian function, L(x,λ) = f(x) + λg(x).
It can be shown that it is a necessary condition that the optimum is a critical point of this new function. We can find this point by solving two equations:
∂L(x,λ) ∂x
= 0 and
∂L(x,λ) ∂λ
= 0
Cookbook formulas. Here are some formulas you might want to consider using to compute difficult gradients. You can use them in the homework without proof. If you are looking to hone your matrix calculus skills, try to find different ways to prove these formulas yourself (will not be part of the evaluation of this homework). In general, you can use any formula from the matrix cookbook, as long as you cite it. We opt for the following common notation: X− := (X)−1
∂aX−1b ∂X
= −X−abX− ∂ ln|det(X)| ∂X = X−
4. Classifying Fruit [15pts]
You’re tasked with classifying three different kinds of fruit, based on their heights and widths. Figure 1 is a plot of the data. Iain Murray collected these data and you can read more about this on his website at http://homepages.inf.ed.ac.uk/imurray2/teaching/oranges_and_lemons/. We have made a slightly simplified (collapsing the subcategories together) version of this available as fruit.csv, which you will find in the Github repository. The file has three columns: type (1=apple, 2=orange, 3=lemon), width, and height. The first few lines look like this:
fruit,width,height 1,8.4,7.3 1,8,6.8 1,7.4,7.2 1,7.1,7.8 ...
4 6 8 10 4
5
6
7
8
9
10
11
Width (cm)
Height (cm)
Apples Oranges Lemons
Figure 1: Heights and widths of apples, oranges, and lemons. These fruit were purchased and measured by Iain Murray: http://homepages.inf.ed.ac.uk/imurray2/teaching/oranges_and_lemons/.
Problem 4 (Classifying Fruit, 15pts) You should implement the following: • The three-class generalization of logistic regression, also known as softmax regression, for these data. You will do this by implementing gradient descent on the negative log likelihood. You will need to find good values for the learning rate η and regularization strength λ. • A generative classifier with Gaussian class-conditional densities, as in Problem 3. In particular, make two implementations of this, one with a shared covariance matrix across all of the classes, and one with a separate covariance being learned for each class. Note that the staff implementation can switch between these two by the addition of just a few lines of code. In the separate covariance matrix case, the MLE for the covariance matrix of each class is simply the covariance of the data points assigned to that class, without combining them as in the shared case.
You may use anything in numpy or scipy, except for scipy.optimize. That being said, if you happen to find a function in numpy or scipy that seems like it is doing too much for you, run it by a staff member on Piazza. In general, linear algebra and random variable functions are fine. The controller file is problem4.py, in which you will specify hyperparameters. The actual implementations you will write will be in LogisticRegression.py and GaussianGenerativeModel.py.
You will be given class interfaces for GaussianGenerativeModel and LogisticRegression in the distribution code, and the code will indicate certain lines that you should not change in your final submission. Naturally, don’t change these. These classes will allow the final submissions to have consistency. There will also be a few hyperparameters that are set to irrelevant values at the moment. You may need to modify these to get your methods to work. The classes you implement follow the same pattern as scikit-learn, so they should be familiar to you. The distribution code currently outputs nonsense predictions just to show what the high-level interface should be, so you should completely remove the given predict() implementations and replace them with your implementations. • The visualize() method for each classifier will save a plot that will show the decision boundaries. You should include these in this assignment. • Which classifiers model the distributions well? • What explains the differences? In addition to comparing the decision boundaries of the three models visually: • For logistic regression, report negative log-likelihood loss for several configurations of hyperparameters. Why are your final choices of learning rate (η) and regularization strength (λ) reasonable? Plot loss during training for the best of these configurations, with iterations on the x-axis and loss on the y-axis (one way to do this is to add a method to the LogisticRegression Class that displays loss). • For both Gaussian generative models, report likelihood. In the separate covariance matrix case, be sure to use the covariance matrix that matches the true class of each data point.
Calibration [1pt]
Approximately how long did this homework take you to complete?

More products