Starting from:

$29.99

Homework 3 Density estimation: Psychological experiments.

ISYE 6740, Homework 3
100 points + 15 bonus points

1. Density estimation: Psychological experiments. (45 points)
We will use this data to study whether or not the two brain regions are likely to be independent of each other and considering different types of political view For this question; you
can use the proper package for histogram and KDE; no need to write your own.
The data set n90pol.csv contains information on 90 university students who participated in
a psychological experiment designed to look for relationships between the size of different
regions of the brain and political views. The variables amygdala and acc indicate the volume
of two particular brain regions known to be involved in emotions and decision-making, the
amygdala and the anterior cingulate cortex; more exactly, these are residuals from the predicted volume, after adjusting for height, sex, and similar body-type variables. The variable
orientation gives the students’ locations on a five-point scale from 1 (very conservative) to 5
(very liberal). Note that in the dataset, we only have observations for orientation from 2 to
5.
(a) (10 points) Form the 1-dimensional histogram and KDE to estimate the distributions
of amygdala and acc, respectively. For this question, you can ignore the variable orientation.
(b) (10 points) Form 2-dimensional histogram for the pairs of variables (amygdala, acc).
Decide on a suitable number of bins so you can see the shape of the distribution clearly.
Also use kernel-density-estimation (KDE) to estimate the 2-dimensional density function of (amygdala, acc). Use a simple multi-dimensional Gaussian kernel, for
x =

x1
x2

∈ R
2
,
where x1 and x2 are the two dimensions respectively
K(x) = 1

e

(x1)
2+(x2)
2
2 .
Recall in this case, the kernel density estimator (KDE) for a density is given by
p(x) = 1
m
Xm
i=1
1
h
K

x
i − x
h

,
1
where x
i are two-dimensional vectors, h > 0 is the kernel bandwidth. Set an appropriate h so you can see the shape of the distribution clearly. Plot the contour plot (like
the ones in slides) for your estimated density. For this question, you can ignore the
variable orientation.
(c) (10 points) Using (a) and (b), using KDE estimators, verify whether or not the variables amygdala and acc are independent? You can tell this by checking do we approximately have p(amygdala, acc) = p(amygdala)p(acc)? To verify this, please show
three plots: the map for p(amygdala, acc), the map for p(amygdala)p(acc) and the error
map |p(amygdala, acc)−p(amygdala)p(acc)|. Comment on your results and whether this
helps us to find out whether the two parts of brains (for emotions and decision-making)
functions independently or they are related.
(d) (5 points) Now we will consider the variable orientation. We will estimate the conditional distribution of the volume of the amygdala, conditioning on political orientation:
p(amygdala|orientation = c), c = 2, . . . , 5. Do the same for the volume of the acc: Plot
p(acc|orientation = c), c = 2, . . . , 5. You will use KDE to achieve the goal. (Note that
the conditional distribution can be understood as fitting a distribution for the data
with the same (fixed) orientation. Thus there should be 4 one-dimensional distribution
functions to show for this question.)
(e) (5 points) Again we will consider the variable orientation. We will estimate the conditional joint distribution of the volume of the amygdala and acc, conditioning on a
function of political orientation: p(amygdala, acc|orientation = c), c = 2, . . . , 5. You will
use two-dimensional KDE to achieve the goal.
(f) (5 points) Using (d) and (e), evaluate whether or not the two variables are likely to be
conditionally independent. To verify this, please show three plots: the map for
p(amygdala, acc|orientation = c),
the map for
p(amygdala|orientation = c)p(acc|orientation = c)
and the error map
|p(amygdala, acc|orientation = c) − p(amygdala|orientation = c)p(acc|orientation = c)|,
c = 2, . . . , 5. Comment on your results and whether this helps us to find out whether
the two parts of brains (for emotions and decision-making) functions independently or
they are related, conditionally on the political orientation (i.e., considering different
types of personality).
2
2. Implementing EM for MNIST dataset, with PCA for dimensionality reduction. (55 points)
Implement the EM algorithm for fitting a Gaussian mixture model for the MNIST dataset.
We reduce the dataset to be only two cases, of digits “2” and “6” only. Thus, you will fit
GMM with C = 2. Use the data file data.mat or data.dat. True label of the data are also
provided in label.mat and label.dat
The matrix images is of size 784-by-1990, i.e., there are totally 1990 images, and each
column of the matrix corresponds to one image of size 28-by-28 pixels (the image is vectorized;
the original image can be recovered by map the vector into a matrix).
First use PCA to reduce the dimensionality of the data before applying to EM. We will
put all “6” and “2” digits together, to project the original data into 5-dimensional vectors.
Now implement EM algorithm for the projected data (with 5-dimensions).
(a) (5 points) Select from data one raw image of “2” and “6” and visualize them, respectively.
(b) (10 points) Write down detailed expression of the E-step and M-step in the EM algorithm (hint: when computing τ
i
k
, you can drop the (2π)
n/2
factor from the numerator
and denominator expression, since it will be canceled out; this can help avoid some
numerical issues in computation).
(b) (15 points) Implement EM algorithm yourself. Use the following initialization
• initialization for mean: random Gaussian vector with zero mean
• initialization for covariance: generate two Gaussian random matrix of size n-byn: S1 and S2, and initialize the covariance matrix for the two components are
Σ1 = S1S
T
1 + In, and Σ2 = S2S
T
2 + In, where In is an identity matrix of size
n-by-n.
Plot the log-likelihood function versus the number of iterations to show your algorithm
is converging.
(c) (15 points points) Report, the fitted GMM model when EM has terminated in your
algorithms as follows. Make sure to report the weights for each component, and the
mean of each component (you can reformat the vector to make them into 28-by-28
matrices and show images). Ideally, you should be able to see these means corresponds
to “average” images. Report the two 784-by-784 covariance matrices by visualize their
intensities.
(d) (10 points) Use the τ
i
k
to infer the labels of the images, and compare with the true labels.
Report the mis-classification rate for digits “2” and “6” respectively. Perform K-means
clustering with K = 2 (you may call a package or use the code from your previous
homework). Find out the mis-classification rate for digits “2” and “6” respectively,
and compare with GMM. Which one achieves the better performance?
3
3. (Bonus) Implementing EM for MNIST dataset, with low-rank
approximation. (15 points)
In this question, we will implement the EM algorithm for fitting a Gaussian mixture model for
the same MNIST dataset, but use a different approach to deal with the high-dimensionality of
data. We will use the so-called low-rank approximation, which can be viewed as performing
“PCA” for each Gaussian component separately in the iterations. This question aims to
demonstrate the principle of low-rank approximation for high-dimensional data, which is a
very general principle in analyzing high-dimensional data.
Note that the pdf of the multivariate normal distribution is given by
N (X|µ, Σ) = 1
|Σ|
1/2
(2π)
d/2
exp 

1
2
(X − µ)

−1
(X − µ)

.
Low-rank approximation of the covariance matrix Σ means that we will represent it only
using a relatively small number of eigenvectors (this corresponds to project the data into a
low-dimensional subspace). For instance, the rank r approximation to the covariance matrix
is given by
Σ ≈ UΛU
T
where U ∈ R
n×r
consists of the r-largest eigenvectors, Λ ∈ R
r×r
is a diagonal matrix consists
of the corresponding r largest eigenvalues {λ1, . . . , λr}.
Using this approximation, thus we can write the factors in the pdf of the multivariate
normal as follows
|Σ| ≈ Yr
i=1
λi
where λi
’s are the eigenvalues ranked from the largest to the smallest. And
(X − µ)

−1
(X − µ) = kΛ
−1/2U
T
(X − µ)k
2
.
and thus, evaluating this quadratic term is really easy: we first compute the projection of
data point after subtracting the mean U
T
(X −µ) on to the subspace formed by U, and then
evaluating Λ−1/2 = diag{λ
−1/2
1
, . . . , λ−1/2
r }, and finally multiplying them together and find
out the squared `2 norm of the resulted vector.
The point is: the low-rank approximation will (i) avoid numerical errors by considering
low-rank approximation to the original covariance matrix; and (ii) speeding up computation
because the matrix operations are now down in a much smaller scale.
The low-rank structure’s intuition is the following: high-dimensional data are usually
lying close to a low-dimensional structure because they have “similarity” across the data
points. In this example, all the hand-written digits “1” have similarities, and that’s why
they can be identified as 1. High-dimensional data rarely span the whole space (in that case,
they tend to be meaningless or noise-like). So it is critical to find out the low-dimensional
space that they actually lie on and perform the projection. Suppose we do not account for
the low-dimensional structure in the data, in fact. In that case, we will run into numerical
4
issues: the covariance matrix Σk in the iteration of EM will be rank-deficient. It cannot be
inverted (that’s because they will have nearly zero eigenvalues and large condition numbers,
in numerical linear algebra sense). This is also the reason in Question 2; we perform PCA
as a pre-processing step to reduce dimensionality and avoid this issue. The low-dimensional
approximation you will use here for this question will achieve better performance because
you will perform an individualized low-rank approximation for each Gaussian component,
which will more precisely account for the data structure. You will show this point using
numerical study.
(b) (10 points) Implement the low-rank approximation (with r = 5) and perform EM algorithm. Use the similar initialization method as the last question. Plot the log-likelihood
function versus the number of iterations to show your algorithm is converging.
(d) (5 points) Use the τ
i
k
to infer the labels of the images, and compare with the true
labels. Report the mis-classification rate for digits “2” and “6” respectively. Compare
with GMM using PCA that you have implemented in Question 2. Which one achieves
better performance?
5

More products