Starting from:

$30

Homework 4 Marginalization and Log-Sum-Exp

CSC311 
Homework 4

Submission: You need to submit the following files through MarkUs1
:
• Your solutions to Questions 1, 2, 3 as a PDF file, hw4_writeup.pdf.
• Your completed Python code for Question 2, as q2.py.
Neatness Point: One point will be given for neatness. You will receive this point as long as we
don’t have a hard time reading your solutions or understanding the structure of your code.
Late Submission: 10% of the marks will be deducted for each day late, up to a maximum of 3
days. After that, no submissions will be accepted.
Computing: To install Python and required libraries, see the instructions on the course web page.
Homeworks are individual work. See the Course Information handout2
for detailed policies.
1. [2pts] Marginalization and Log-Sum-Exp. In this question you will learn how to compute log marginal probabilities in a numerically stable way. Suppose that you have a generative model p(x, i) for labeled data (x, i) where i is a label that can be one of 0, 1, 2, . . . , k.
Recall that the marginal probability p(x) can be computed using the following formula:
p(x) = X
k
i=0
p(x, i) (0.1)
When x is a high-dimensional data point, it is typical for the marginal p(x) and the joint
p(x, i) to be extremely small numbers that cannot be represented in floating point. For this
reason, we usually report and compute with log probabilities.
If we want to compute log(p(x)) only given access to log(p(x, i)), then we can use what is
called a log-sum-exp:
log X
k
i=0
exp(ai)
!
(0.2)
where ai ∈ R are real numbers. In our example, if ai = log(p(x, i)), then expression (??)
would correspond to the log marginal probability log(p(x)) of x.
Unfortunately, computing log-sum-exp naively can lead to numerical instabilities. The numerical instabilities in log-sum-exp are caused by problems that arise when trying to compute
exponentials using floating point numbers. Two things can go wrong:
(a) Underflow. If a[i] is very small, then np.exp(a[i]) will evaluate to 0.
(b) Overflow. If a[i] is very large, then np.exp(a[i]) will evaluate to inf.
The cause of underflow and overflow is that floating point numbers cannot represent numbers
arbitrarily close to 0 nor arbitrarily large numbers.
1
https://markus.teach.cs.toronto.edu/csc311-2020-09
2
http://www.cs.toronto.edu/~rgrosse/courses/csc311_f20/syllabus.pdf
1
CSC311 Fall 2020 Homework 4
(a) [1pt] We have provided code in q1.py with two implementations of log-sum-exp: a
naive, numerically unstable implementation and a numerically stable one. Modify the
elements of a so that logsumexp unstable returns -inf, and modify the elements of
b so that logsumexp unstable returns inf. Report the two vectors, a and b, in your
write-up.
(b) [1pt] Prove that our numerically stable implementation is correct by proving that
log X
k
i=0
exp(ai)
!
= log X
k
i=0
exp
ai −
k
max
j=1
{aj}
!
+
k
max
j=1
{aj} (0.3)
Briefly explain why the numerically stable version is robust to underflow and overflow.
Report your answers to the above questions.
2. [3pts] Gaussian Discriminant Analysis. For this question you will build classifiers to
label images of handwritten digits. Each image is 8 by 8 pixels and is represented as a vector
of dimension 64 by listing all the pixel values in raster scan order. The images are grayscale
and the pixel values are between 0 and 1. The labels y are 0, 1, 2, . . . , 9 corresponding to
which character was written in the image. There are 700 training cases and 400 test cases for
each digit; they can be found in a4digits.zip.
A skeleton (q2.py) is is provided for each question that you should use to structure your code.
Starter code to help you load the data is provided (data.py). Note: the get digits by label
function in data.py returns the subset of digits that belong to a given class.
Using maximum likelihood, fit a set of 10 class-conditional Gaussians with a separate, full
covariance matrix for each class. Remember that the conditional multivariate Gaussian probability density is given by,
p(x | y = k, µ, Σk) = (2π)
−d/2
|Σk|
−1/2
exp

1
2
(x − µk
)
T Σ
−1
k
(x − µk
)

(0.4)
You should take p(y = k) = 1
10
. You will compute parameters µkj and Σk for k ∈ (0...9), j ∈
(1...64). You should implement the covariance computation yourself (i.e. without the aid of
’np.cov’). Hint: To ensure numerical stability you may have to add a small multiple of the
identity to each covariance matrix. For this assignment you should add 0.01I to each matrix.
(a) [1pt] Using the parameters you fit on the training set and Bayes rule, compute the
average conditional log-likelihood, i.e. 1
N
PN
i=1 log(p(y
(i)
| x
(i)
, θ)) on both the train and
test set and report it. Hint: you will want to use the log-sum-exp we discussed in
Question 1 to your code.
(b) [1pt] Select the most likely posterior class for each training and test data point as your
prediction, and report your accuracy on the train and test set.
(c) [1pt] Compute the leading eigenvectors (largest eigenvalue) for each class covariance
matrix (can use np.linalg.eig) and plot them side by side as 8 by 8 images.
Report your answers to the above questions, and submit your completed Python code for
q2.py.
2
CSC311 Fall 2020 Homework 4
3. [3pts] Categorial Distribution. In this problem you will consider a Bayesian approach to
modelling categorical outcomes. Let’s consider fitting the categorical distribution, which is
a discrete distribution over K outcomes, which we’ll number 1 through K. The probability
of each category is explicitly represented with parameter θk. For it to be a valid probability
distribution, we clearly need θk ≥ 0 and P
k
θk = 1. We’ll represent each observation x as
a 1-of-K encoding, i.e, a vector where one of the entries is 1 and the rest are 0. Under this
model, the probability of an observation can be written in the following form:
p(x|θ) = Y
K
k=1
θ
xk
k
.
Suppose you observe a dataset,
D = {x
(i)
}
N
i=1.
Denote the count for outcome k as Nk =
Pn
i=1 x
(i)
k
. Recall that each data point is in the
1-of-K encoding, i.e., x
(i)
k = 1 if the ith datapoint represents an outcome k and x
(i)
k = 0
otherwise. In the previous assignment, you showed that the maximum likelihood estimate for
the counts was:
ˆθk =
Nk
N
.
(a) [1pts] For the prior, we’ll use the Dirichlet distribution, which is defined over the set of
probability vectors (i.e. vectors that are nonnegative and whose entries sum to 1). Its
PDF is as follows:
p(θ) ∝ θ
a1−1
1
· · · θ
ak−1
K .
What is the probability distribution of the posterior distribution p(θ | D)?
(b) [1pt] Still assuming the Dirichlet prior distribution, determine the MAP estimate of the
parameter vector θ. For this question, you may assume each ak > 1.
(c) [1pts] Now, suppose that your friend said that they had a hidden N + 1st outcome,
x
(N+1), drawn from the same distribution as the previous N outcomes. Your friend does
not want to reveal the value of x
(N+1) to you. So, you want to use your Bayesian model
to predict what you think x
(N+1) is likely to be. The “proper” Bayesian predictor is the
so-called posterior predictive distribution:
p(x
(N+1)|D) = Z
p(x
(N+1)|θ)p(θ|D) dθ
What is the probability that the N +1 outcome was k, i.e., the probability that x
(N+1)
k =
1, under your posterior predictive distribution? Hint: A useful fact is that if θ ∼
Dirichlet(a1, . . . , aK), then
E[θk] = P
ak
k
0 ak
0
.
Report your answers to the above questions.
3

More products