Homework 4 Expectation Maximization

Your shopping cart is empty.

EECS 445
Department of Electrical Engineering and Computer Science
EECS 445 Introduction to Machine Learning

Homework 4

1 Expectation Maximization [15 pts]
Suppose we have three cities A, B and C each with a probability of having sunny weather, θA, θB and θC,
unknown to us. To predict the weather of each city, we conduct an experiment. The experiment consists of
N trials. For each trial, we choose one city from A, B and C uniformly at random, detect the weather of the
chosen city over M days, and record the sequence of sunny / rainy. On the i-th trial, we define:
• z
(i)
is the city chosen for this trial, z
(i) ∈ {A, B, C}
• x
(i)
j
is the j-th detection for this trial, x
(i)
j ∈ {S,R}
• si
is the number of sunny days recorded on this trial.
The data are generated as follows:
for trial i ∈ {1, . . . , N} :
z
(i) ∼ Uniform{A, B, C}
for detection j ∈ {1, . . . , M} :
x
(i)
j ∼ Bernoulli(θz
(i) ), where θz
(i) is the probability of sunny.
The results of our experiment (with N = 6 trials and M = 10 detections per trial) are shown below.
Trial i City z
(i) Detection x¯
(i) = [x
(i)
1
, . . . , x
(i)
M ]
1 A SRRRRSRRSR
2 B SSRSRSSSSR
3 C RSSSSRRSRS
4 B SSSSSSSSRS
5 C SRRSSRRSSR
6 A RRSRRSRRRR
EECS 445, Fall 2019 – Homework 4 2
1.0 Visualizing Bayesian Networks
(a) (1 pt) Draw the associated Bayesian Network describing this generative model (using plate notation).
(b) (1 pt) What are the learnable parameters associated with this generative model?
(c) (1 pt) What are the latent (unobserved) variables of this generative model? What do they represent?
1.1 Complete-Data Case
In this section, we consider the scenario in which we know both the resulting sequences and which city was
chosen on each trial.
(a) (2 pt) Write down an expression for the complete-data log-likelihood log p(¯x
(i)
, z(i)
;
¯θ) of a single
trial, in terms of si
, M, and θz
(i) .
(b) (1 pt) Given the data from all N trials X = {x¯
(i)}
N
i=1 and Z = {z
(i)}
N
i=1, write down an expression
for the complete-data log-likelihood log p(X , Z;
¯θ).
(c) (2 pt) Using the complete-data log-likelihood, derive expressions for the maximum-likelihood estimates of θA, θB and θC and report estimates of θA, θB and θC for the sample data. In words, describe
what these estimates correspond to.
1.2 Classification-Maximization
Now, assume the city indicators z
(i)
are unobserved (i.e., the city column in the table at the beginning of this
problem is not available). That is, assume our experiment works as follows: you are a student stuck in BBB
doing the 445 project. Your teammate decides to go to the three cities A, B and C with equal probability
but he will not tell you which city he goes to. On each trial, he randomly chooses a city, detects its weather
M times and records the sequence. The data are now “incomplete.”
The Classification-Maximization algorithm estimates the model parameters by alternating between two
steps: 1) Determining trial city assignment, 2) Maximizing likelihood. More precisely, it alternates between
the following two steps after randomly initializing the parameters ¯θ
(0):
C-step: Compute zˆ
(i)
(t+1) = arg max
k∈{A,B,C}
p(z
(i) = k | x¯
(i)
;
¯θ
(t)
) for i = 1, . . . , N
(ties broken arbitrarily)
M-step: Update ¯θ
(t+1) = arg max
θ¯
p(¯x
(i)
, zˆ
(i)
(t+1);
¯θ)
(maximum likelihood estimation given fixed city assignments)
For this section, your task is to derive a Classification-Maximization procedure for the city detection experiment.
EECS 445, Fall 2019 – Homework 4 3
(a) (2 pt) Write down an expression for p(z
(i) = k | x¯
(i)
;
¯θ
(t)
), the probability that trial i in city k ∈
{A, B, C} given the data x¯
(i)
and the current parameter estimate ¯θ
(t)
. Express your answer in terms
of si
, M and the model parameters θ
(t)
A
, θ(t)
B
, θ(t)
C
.
(b) (2 pt) Perform three iterations of Classification-Maximization for the sample data (either with code
or by hand) and report your values of p(z
(i) = k | x¯
(i)
,
¯θ
(t)
), zˆ
(i)
(t)
, and ¯θ
(t)
for each step. Initialize
¯θ
(0) = (0.1, 0.8, 0.5). In the case of a tie, choose the lexicographically smaller city, i.e., A < B < C.
EECS 445, Fall 2019 – Homework 4 4
1.3 Expectation-Maximization
As you may have noticed, Classification-Maximization struggles to make a decision when there is more than
one plausible assignment for an observed sequence, and has no way of expressing the level of confidence in
which city was chosen for a sequence. The Expectation-Maximization algorithm overcomes this limitation.
Instead of using fixed assignments, it computes the probability p(z
(i) = k | x¯
(i)
; θ) that each point x¯
(i)
belongs to each distribution z
(i) = k ∈ {A, B, C}.
In discussion and in the above questions, we saw it was possible to write the complete-data log-likelihood
with indicator variables that indicate which point belongs to which distribution. Now we can replace the
indicator variables in the log-likelihood function with the probabilities above. This is analogous to going
from hard-assignments to soft-assignments.
In this case, the Expectation-Maximization algorithm repeats the following steps until convergence.
E-Step. Evaluate p(z
(i) = k | x¯
(i)
;
¯θ
(t)
) for each i = 1, . . . , N and each k ∈ {A, B, C}
M-Step. Update ¯θ
(t+1) =
arg max
θ¯
PN
i=1
P
k∈{A,B,C}
p(z
(i) = k | x¯
(i)
;
¯θ
(t)
) log
1
3
(θk)
si (1 − θk)M−si
?
Now, you will derive an instance of the Expectation-Maximization algorithm for our city detection experiment.
(a) (2 pt) Using the equations above, show that the updates for θk is as follows:
θ
(t+1)
k =
PN
i=1 p(z
(i) = k | x¯
(i)
; θ
(t)
) × si
M ×
PN
i=1 p(z
(i) = k | x¯
(i)
; θ
(t))
(b) (1 pt) Describe in words how the parameters above are updated; that is, what are these equations
doing? Do they correspond to your intuition?
2 Gaussian Mixture Models [5 pts]
Given a training set of one-dimensional examples x ∈ R, we aim to estimate the parameters of a mixture of
two Gaussians. The mixture distribution over x is given by:
P(x; θ) = γ1N (x; µ1, σ2
1
) + γ2N (x; µ2, σ2
2
)
We would like to estimate the parameters θ = {γj , µj , σ2
j
} for j = 1 and j = 2. This would be easy if we
knew which points in our dataset were sampled from which Gaussian. However, we do not have these data.
We have incomplete information, so we turn once again to Expectation Maximization (EM). Given a datase
EECS 445, Fall 2019 – Homework 4 5
of 5 points {2, 3, 4, 6, 8}, we run EM for two iterations and plot γ1N (x; µ1, σ2
1
) as a function of x with a
solid line and γ2N (x; µ2, σ2
2
) with a dashed line. But, we ran so many experiments, that we’ve mixed up
the figures. We have narrowed it down to three figures. We need your help to narrow it down further and
order them.
0 1 2 3 4 5 6 7 8 9 10
0.00
0.02
0.04
0.06
0.08
0.10
0.12
0.14
0.16
0.18
0.20
0.22
0 1 2 3 4 5 6 7 8 9 10
0.00
0.02
0.04
0.06
0.08
0.10
0.12
0.14
0.16
0.18
0.20
0.22
Figure I
0 1 2 3 4 5 6 7 8 9 10
0.00
0.02
0.04
0.06
0.08
0.10
0.12
0.14
0.16
0.18
0.20
0.22
0 1 2 3 4 5 6 7 8 9 10
0.00
0.02
0.04
0.06
0.08
0.10
0.12
0.14
0.16
0.18
0.20
0.22
Figure II
0 1 2 3 4 5 6 7 8 9 10
0.00
0.02
0.04
0.06
0.08
0.10
0.12
0.14
0.16
0.18
0.20
0.22
0 1 2 3 4 5 6 7 8 9 10
0.00
0.02
0.04
0.06
0.08
0.10
0.12
0.14
0.16
0.18
0.20
0.22
Figure III
(a) (2pt) Assign two of the figures abbove to the correct steps. Use the following tags “Step 0: initialization,” “Step 1: after one EM-iteration,” “erroneous”
Figure I : Figure II : Figure III :
(b) (2 pt) Explain how the mixture for “Step 1” follows from the mixture in “Step 0”.
(c) (1 pt) In EM we consider soft-assignments, i.e., we evaluate p(z
(i) = j|x
(i)
; θ) for each i = 1, · · · , n
and each j ∈ {1, 2}, where z
(i)
is the latent cluster assignment. If we were to reconsider hard
assignments, how would the algorithm change? Specifically, how would we estimate z
(i)
?
3 Bayesian Networks [5 pts]
We would like to modify the Hidden Markov Model by changing how the variables influence each other.
Figure 1 a typical example of a HMM.
Figure 1: Typical HMM
EECS 445, Fall 2019 – Homework 4 6
Instead of using HMM’s, we will be looking at an alternative Bayesian Network that tries to solve the same
problem of hidden variables and observations. The network is as follows:
Figure 2: Alternative to HMM
(a) (2 pt) Check the following independence conditions of the graph:
• Are X1, X2, and X3 all independent of each other given Z1 and Z2
(i.e., Xi |=Xj |{Z1, Z2} for all i and j, where i 6= j)?
• Is Z3 independent of Z1 given Z2?
(b) (1 pt) Write down the factored joint probability distribution associated with the Bayesian Network
described in part a.
(c) (1 pt) If the random variables are all binary, how many parameters are there in the joint probability
distribution implied by your graph?
(d) (1 pt) Suppose we learned the parameters for the above model, in addition to the parameters for an
HMM (in which the Z’s represent hidden states, and the X’s represent observations). Applied to the
training data both models achieve the same log-likelihood. Which model should we use? Briefly
explain your answer.
4 Another Bayesian Network [3 pts]
Consider the following Bayesian Network.
EECS 445, Fall 2019 – Homework 4 7
Figure 3: Bayesian network for humidity
(a) (1 pt) Write down the factored joint probability distribution associated with this Bayesian Network.
(b) (2 pt) Suppose we observe that it is humid (i.e., H=T), what is the probability that it is raining?
REMEMBER to submit your completed assignment to Gradescope by 11:59pm ET on Sunday,
December 8. Include all code as an appendix following your write-up.

Shopping cart

US$0

Homework 4 Expectation Maximization

More products