Starting from:

$30

Bayesian Classification, Naive Bayes

Bayesian Classification, Naive Bayes
Submit a PDF of your answers to Canvas.
1. Univariate Gaussian. Let X ∼ N (µ, σ2
).
a) For any a, b ∈ R, Y = aX + b is also Gaussian. If X ∼ N (µ, σ2
), what is the
mean and variance of Y ?
b) Given X ∼ N (µ, σ2
), find a, b ∈ R
n
so that aX + b ∼ N (0, 1).
c) Use the result from above to generate 10,000 independent samples of a random
variable ∼ N (−5, 7). You may want to use the NumPy function np.random.randn().
Plot a histogram of the results.
2. Consider a random vector x ∈ R
n×1 with mean µx ∈ R
n×1 and covariance Σx, a
deterministic matrix A ∈ R
m×n
, and a deterministic vector c ∈ R
n×1
.
a) Consider the random vector y = A(x + c). Find an expression for E[y].
b) Find an expression for Σy (the covariance matrix of y) in terms of Σx.
3. Na¨ıve Bayes on MNIST. In this problem we will classify MNIST handwritten digits
using MAP and Na¨ıve Bayes. The class y ∈ {0, 1, . . . , 9} represents the true digit while
x ∈ {0, 1}
784 is the 28 × 28 black and white image that we’d like to classify.
Recall that if we know the posterior (i.e, the distribution of the class given the data,
p(y|x)), we can use the Maximum A Posteriori (MAP) estimate as a classification rule:
yb = arg max
y
p(y|x)
where yb is our estimate of the class. Equivalently,
yb = arg max
y
p(x|y)p(y)
which follows from Bayes Rule, since p(y|x) = p(x|y)p(y)/p(x), and then dropping
p(x) from the denominator, since it doesn’t depend on y and is inside the arg max.
The MAP classifier is very intuitive – as it aims to find the most likely class y, given
the data x.
Before we can use classifier, we need to (approximately) learn p(x|y) and p(y). Learning
p(y) is easy, since we simply count the occurrences of the classes in the training data.
1 of 3
Learning p(x|y) in general is hard, and even with 60k training examples, we don’t
seem to have enough data. Notice that when the images have binary values, there are
2
784 possible images. It’s difficult to estimate the probability of one out of 2784 possible
images when we only have 60k training examples. There are in essence 2784 different
parameters that we must estimate.
Na¨ıve Bayes makes a very na¨ıve assumption: all elements of x are independent. More
precisely Na¨ıve Bayes assumes p(x|y) ≈
Qn
i=1 p(xi
|y).
a) How many different parameters are needed to describe the distribution p(x|y) if
we assume p(x|y) ≈
Qn
i=1 p(xi
|y)? Assume each pixel xi only takes on values in
{0, 1}.
b) Is the independence assumption p(x|y) ≈ p(x1|y) × p(x2|y). . . a reasonable assumption for the MNIST dataset? Why or why not?
c) Note that p(xi
|y) is the (marginal) probability that pixel xi
is 0 or 1 given the
class y. We can estimate p(xi = 0|y = 9), for example, by counting the number
of times pixel i is equal to 0, vs. the number of times it is equal to 1 among the
images that are labeled y = 9:
p(xi = 0|y = 9) = count of examples of class 9 with pixel i equal to 0
count of examples of class 9 . (1)
Write code that estimates p(xi = 0|y) and p(xi = 1|y) by counting the occurrences
xi = 0 and xi = 1 for y = 0, 1, . . . , 9 and for reach pixel i. Store your estimates of
p(xi = 0|y) and p(xi = 1|y), each in ten 28 × 28 matrices. Two tricks to help:
• sometimes the numerator in (1) will be zero. Add a small number (for example, 0.7) to both the numerator and denominator. This is called additive or
Laplace smoothing.
• store the log of the quantity above (to avoid multiplying small numbers when
we multiply the marginal probabilities of each pixel in the next step).
d) For numerical stability, we can compute P
i
log(p(xi
|y)) instead of Q
i
p(xi
|y)),
without changing the problem. For a new test image x, write code to output an
estimate:
yb = arg max
y
Xn
i=1
log(p(xi
|y)).
e) What is the classification error rate of your classifier on the 10k test images?
f) (optional) The digits are close to equally likely (i.e, p(y = 0) ≈ p(y = 1). . .), so
we didn’t incorporate the prior probability. Incorporate the prior probabilities, as
estimated from the training data, into your classifier. How does the classification
error rate change?
2 of 3
g) (optional) We now have a probablistic model given by p(x|y) that we can use to
generate random MNIST images. Display a few random images using the Na¨ıve
Bayes model you derived above. Do the images look like real MNIST images?
h) (optional) Randomly permute the pixels of a few images. Are the permuted
pixel images recognizable by a human (display a few)? If you randomly permute
the pixels and retrain your Na¨ıve Bayes classifier, will the performance change?
What does this say about our model?
3 of 3

More products