$30
MMSE, Conditional Independence and MNIST
Submit a PDF of your answers to Canvas.
1. Bayes for continuous feature vectors. Let x ∈ R
n ∼ f(x) be vector of continuous
random variables, and let y be a discrete random variable. Show from first principles
that
p(y|x) = f(x|y)p(y)
f(x)
.
Hint: Start with definition of conditional probability
P(A|B) = P(A, B)
P(B)
and use the definition of a multivariate pdf
f(x) = lim
∆→0
P(x1 < X1 ≤ x1 + ∆, x2 < X2 ≤ x2 + ∆, . . .)
∆n
.
2. Regression with MMSE. Consider a feature vector x ∈ R
2 and y ∈ R with joint pdf
p(x, y) = (
6 for x1, x2, y ≥ 0 and x1 + x2 + y ≤ 1
0 otherwise.
You aim to find a function f(x) to estimate y.
a) Find the function f(x) that minimizes E[`(f(x), y)] under the squared error loss
function.
1 of 3
b) What is the minimum true risk E[`(f(x), y)], under the squared error loss function? You can leave your answer as as integral.
3. Bayes Nets on MNIST. Note that Activity 12 will be helpful for completing this problem. Recall the MAP classification rule:
yb = arg min
y
p(x|y)p(y)
By repeated application of the product rule of probability, a distribution p(x|y) can
be written as:
p(x|y) = p(x1|y)p(x2|x1, y)p(x3|x1, x2, y)p(x4|x1, x2, x3, y). . . p(xn|x1, . . . , xn−1, y).
Last time we used Na¨ıve Bayes to estimate p(x|y). Recall that the class y ∈ {0, 1, . . . , 9}
represents the true digit, while x ∈ {0, 1}
784 is the 28 × 28 black and white image that we’d like to classify. Na¨ıve Bayes make the often poor assumption that
p(x|y) ≈
Qn
i=1 p(xi
|y).
In this problem we will make a compromise to model relationships between pixels. More
specifically, we will estimate p(x|y) by making a conditional independence assumption:
we will assume that the probability of a pixel, conditioned on the values of the pixels
to the left and above, it is independent of all the other pixels with a lower index in the
image.
a) Imagine the pixels are enumerated so that the x1 is the pixel in the upper left
corner of the image, x2 is the pixel below x1 (in the first column and second row)
and so on. Simplify the expression
p(x34|x1, x2, . . . , x33)
if the pixels are conditionally independent given their neighbors (the pixel immediately to the left and above).
b) Find an expression for p(x|y) given the conditional independence assumptions.
You can ignore any discrepancy for edges cases (when xi corresponds to a pixel
in the first column or bottom row).
c) How many parameters do we need to estimate? How does this compare to the
Naive Bayes case or the general case?
d) We can estimate p(xi = 0|xj = 0, xk = 0, y = 9) (for example) empirically by
counting the number of times pixel i is equal to zero when xj = 0, xk = 0, vs. the
number of times it is equal to 1 when xj = 0, xk = 0 among the images that are
labeled y = 9:
2 of 3
p(xi = 0|xj = 0, xk = 0, y = 9) =
1 + count of examples with pixel i equal to 0 where xj = 0, xk = 0 from class 9
2 + count of examples where xj = 0, xk = 0 from class 9 .
The 1 in the numerator and 2 in the denominator is called ‘Laplace Smoothing’,
and deals with cases where there are no corresponding examples by assigning
them an uncommitted value of 1/2.
Write code that estimates p(xi = 0|xj
, xk, y) and p(xi = 1|xj
, xk, y) by counting
the occurrences. Note that for each pixel, you will have to consider 80 cases:
2 values for the pixel itself xi ∈ {0, 1}, 4 cases for the conditionals, (xj
, xk) ∈
{(0, 0),(0, 1),(1, 0),(1, 1)}, times 10 cases for y = 0, 1, . . . , 9. Store your estimates
of p(xi
|y) in a 2 × 10 × 4 × 28 × 28 array.
e) Write code that computes log likelihood of a new image for each class y = 1, . . . , 9.
Recall that the log likelihood is given by log (p(x|y)) when provided with a test
image x.
f) Use maximum likelihood and your estimate of the log likelihood to classify the test
images. What is the classification error rate of the maximum likelihood classifier
on the 10k test images?
3 of 3