$30
CS 446 / ECE 449 — Homework 5
Version 1.0
Instructions.
• Everyone must submit individually at gradescope under hw5 and hw5code.
• The “written” submission at hw5 must be typed, and submitted in any format gradescope accepts (to
be safe, submit a PDF). You may use LATEX, markdown, google docs, MS word, whatever you like; but
it must be typed!
• When submitting at hw5, gradescope will ask you to mark out boxes around each of your answers;
please do this precisely!
• Please make sure your NetID is clear and large on the first page of the homework.
• Your solution must be written in your own words.
• When submitting to hw5code, only upload hw5.py. Additional files will be ignored.
1
1. Expectation Maximization
In this question, we expect you to do some computation related to the EM algorithm covered in the
lecture.
Background. On the xy-plane, we have a rigid object, and our sensor can capture N key points, whose
coordinates are: P = {p
(1)
, p
(2)
, ..., p
(N)}. (N is a sufficiently large integer.) An unknown force then cause
a translation T to this object, where T is encoded Tx and Ty, meaning how long the object has moved
along the x-axis and y-axis. To calculate parameter T, we use our sensor to capture the key points on the
rigid object one more time, acquiring a set of N key points: Q = {q
(1)
, q
(2)
, ..., q
(N)}. (See Fig. 1 for an
intuitive demonstration. The “L” is our rigid body, N = 3, and the blue and red dots are the key points.)
Assumption. During this process, we assume that an bi-jection mapping between P and Q exists (See
Fig. 1, the key points are all the corners of “L.”) However, this bijection mapping cannot be directly got
from the sensors, as the orders of the points may be shuffled during perception.
𝑝1
𝑝2 𝑝3
𝑞2
𝑞3 𝑞1
Figure 1
Objective. Using EM algorithm to estimate the translation parameter and the corresponding pairs of ke
points, by introducing binary hidden variables Z ∈ R
N×N , where Zij = 1 means p
(i) and q
(j) are a match.
Remark. We have the following remarks on this question.
• This questions set is for you to understand the process of EM with an intuitive example without
annoying mathematical equations. However, to make it easy to understand, we use additional
assumptions and simplifications. Please note the difference between this example and rigorous EM
algorithm when you learn deeper machine learning courses in the future.
• You may find EM overkill for this simple question, and you are right about it. This question originates
from a classical problem in computer vision called “point cloud registration,” where this problem
could be much more difficult when the sensor has noises and the bijection between P and Q does not
exist. You may refer to “iterative closest points” (ICP) if you are interested in this.
(a) Joint Probability. If we know a pair of matched points (p
(i)
, q
(j)
), think of it as a single data
sample, with the corresponding hidden state Zij = 1. Intuitively, based on this single pair, how likely
parameter T is depends on ∥p
(i) + T − q
(j)∥2. To make the math easier, we assume ∥p + T − q∥2
follows from the Gaussian distribution N (0, σ) where σ is known, i.e.,
PT ((p
(i)
, q
(j)
)|Zij = 1) = 1
√
2πσ
exp(−
∥p
(i) + T − q
(j)∥
2
2
2σ
2
) (1)
(To avoid confusion in the notations, please use P to represent probability)
2
But we actually don’t know which points are a match, or say we don’t know the hidden states Z,
and want to use EM to help. We define that the matching is optimized from P to Q, and not the
reverse. In this way, the matching between each point p ∈ P is independent, and two points p ∈ P
can match to the same q ∈ Q, but not the reverse.
Let P(Zij = 1) := Πij indicate the prior probability of p
(i) and q
(j) being a pair, where Π ∈ R
N×N
are unknown parameters. Under our assumption, 0 ≤ Πij ≤ 1, and PN
k=1 Πik = 1, ∀ i ∈ {1, 2, ..., N}.
Let us denote overall parameters (T , Π) as ψ like in the class.
Given a point p
(i)
, it has N possible matches q
(j)
, j = 1, ..., N. The probability contributed by p
(i)
in the complete data likelihood is
P(p
(i)
, Q, Zi,:) = P(p
(i)
, Q|Zi,:)P(Zi,:) = Y
N
j=1
[Pψ((p
(i)
, q
(j)
)|Zij = 1)P(Zij = 1)]Zij (2)
Please write the log-likelihood log Pψ(P, Q, Z) under the two following cases.
i. General Case: Write out the general formula for log Pψ(P, Q, Z) following the “Log-likelihood of
complete data” on Slides 7/27 in lecture 17.
ii. Special Case: Zii = 1, i ∈ {1, 2, 3.., N}. To get the full credits of this question, please full expand
the Gaussian distributions.
(b) Expectation. (1/2) After doing the preparation work, we now start to apply EM algorithm to
this question. In the E-step, our first objective is to compute R, where R = [rij ] ∈ R
N×N , and
rij = Pψ(t) (Zij |p
(i)
, Q).
Following the procedure of “the first line of the E-step on slides 8/27 in lecture 17,” answer the
following questions.
i. General Case: Derive the formula for rij . For the simplicity of notations, you can also use N (·|·, ·)
like on the slides to denote Gaussian distribution.
ii. Special Case: For the N points in P, the first [ N
2
] points are matched with q
(1), and the rest are
matched to q
(2). Write the values of R.
(c) Expectation. (2/2) This question computes the values of Q(ψ|ψ
(t)
). Please answer the following
questions.
i. General Case: Derive the formula of Q(ψ|ψ
(t)
) following the procedure on slides 8/27 in lecture
17. For the simplicity of notations, you can also use N (·|·, ·) to denote Gaussian distribution.
ii. Special Case: Same as the special case mentioned in problem “Expectation (1/2),” fully expand
the formula of Q(ψ|ψ
(t)
. To get full credits, your answer cannot have the variables rij and the
notation for Gaussian distribution N (·|·, ·). (use 0 × log 0 = 0 in your calculation)
(d) Maximization. On the basis of previous derivation, complete the maximization step and the update
rule. Similar to the slides 9/27 in lecture 17, write out the formulas for Π(t+1) and T
(t+1)
.
Hint. σ is fixed and you do not need to solve it.
i. General Case: Write the formulas for Π(t+1) and T
(t+1). You may use N, R, and the points in
P and Q.
ii. Special Case: Write the formulas for T
(t+1) for the special case in question “Expectation (1/2).”
You may use N, and the points in P and Q.
Solution.
3
2. Variational Auto-Encoders
We are training a variational auto-encoder (VAE). It contains the following parts: the input are vectors x,
the latent vector is z, the encoder models the probability of qϕ(z|x), and the decoder is pθ(x|z). Based
on this notation, we will first look at several problems related to the structure of variational auto-encoder.
(a) We assume the latent vector z ∈ R
2
follows a multi-variate Gaussian distribution N . Please compute
the output dimension of the encoder qϕ(·) under the following cases and briefly explain why. (If
“output dimension” is not clear enough for you, think of it as “how many real numbers r ∈ R are
needed to output for the sampling of latent vectors.”)
• We assume N follows a multi-variate Gaussian distribution with an identity matrix as the
covariance matrix.
• We assume N follows a multi-variate Gaussian distribution with an diagonal matrix as the
covariance matrix.
(b) We then consider the problems related to the understanding of KL-Divergence.
i. Using the inequality of log(x) ≤ x − 1, prove that DKL(p(x), q(x)) ≥ 0 holds for two arbitrary
distributions p(x) and q(x).
ii. Consider a binary classification problem with input vectors x and labels y ∈ {0, 1}. The
distribution of the ground truth label is denoted as P(y). The expression of P(y) is as Eq 3,
where ygt is the ground truth label.
P(y = ygt) = 1, P(y = 1 − ygt) = 0 (3)
Suppose we are trying to predict the label of x with a linear model w and sigmoid function, then
the distribution of y is denoted as Q(y) and computed as Eq. 4.
Q(y = 0|x) = 1
1 + exp (−w⊤x)
, Q(y = 1|x) = exp (−w⊤x)
1 + exp (−w⊤x)
(4)
With the above information, compute the KL Divergence between the distributions of P(y) and
Q(y|x), specifically DKL(P(y), Q(y|x)) = Ey∼P (y)
[log P (y)
Q(y|x)
].
Expand your solution to the clearest form. To get full credits, your may only use ygt, w, x and
related constants in your expression.
(c) VAE is a special branch of generative method in sampling the latent vectors ze from qϕ(z|x) instead of
directly regressing the values of z. Read an example implementation of VAE at https://github.com/
AntixK/PyTorch-VAE/blob/master/models/vanilla_vae.py and answer the following questions:
i. Find the functions and lines related to the sampling of ze from qϕ(z|x). Specifying the names of
the functions and the related lines can lead to full credits. Please note that if your range is too
broad (in the extreme case, covering every line in the file) we cannot give your full credit.
ii. Suppose our latent variable is z ∈ R
2
sampled from a Gaussian distribution with mean µ ∈ R
2
and a diagonal covariance matrix Σ = Diag{σ
2
1
, σ2
2}. Then another random variable v ∈ R
2
is
sampled from a Gaussian distribution N (0, I). Show that V = [σ1, σ2]
⊤ ◦ v + µ follows the same
distribution as z. (◦ denotes Hadamard product, which means element-wide product; N (0, I)
denotes the multi-variate Gaussian with zero mean and identity matrix as covariance.)
iii. Under the same setting of the Question ii, we can sample the latent vector ze by the process
ze = [σ1, σ2]
⊤ ◦ ve + µ, where ve is a sampled random variable from N (0, I). Consider the process
of training, where we apply back-propagation to train the neural networks. Given the gradient
on ze as ge ∈ R
2
, which can be written as [ge1, ge2]. What are the gradients of the output of
the encoder: µ, σ1, σ2? (Assume the KL-Divergence loss is not considered in this part.)
Note: To get full credit, you can use any constants and the variables of ve = [ve1, ve2], ge = [ge1, ge2],
and µ, σ1, σ2.
4
iv. During reading the code, you might feel confused about why we are sampling ze in such a way,
instead of generating a random value directly. But now, you could have some clues. Please briefly
explain “Why we are sampling ze with N (0, 1), instead of directly generating the values.”
Solution.
5
3. Generative Adversarial Networks
Let’s implement a Generative Adversarial Network(GAN) to create images of hand-written digits!
GAN consists of two parts: a generator network G and a discriminator network D. G is expected to
generate a fake image from a random latent variable z, and D is expected to distinguish fake images and
real images. G and D are trained jointly with a minimax objective. In this question, we will use training
data from MNIST to train our GAN, and let it produce some fake images that look like hand-written
digits.
(a) First, let’s implement the Discriminator network. It should take 32 × 32 gray-scale images as input,
and output the probability of each image being a real one. Its architecture is summarized in Table 1.
Table 1: Discriminator Architecture
Layer Layer Input Output Kernel Stride Padding
Index Type Channels Channels Size
1 Conv2d 1 16 3 1 1
2 LeakyReLU
3 MaxPool 2 2 0
4 Conv2d 16 32 3 1 1
5 LeakyReLU
6 MaxPool 2 2 0
7 Conv2d 32 64 3 1 1
8 LeakyReLU
9 MaxPool 2 2 0
10 Conv2d 64 128 3 1 1
11 LeakyReLU
12 MaxPool 4 4 0
13 Linear 128 1
14 Sigmoid
A few notes:
• All Conv2d and Linear layers have bias terms. You do not have to explicitly set Conv2d(...,
bias=True), since it is default in PyTorch.
• Also, you do not need to explicitly initialize the weights in Conv2d and Linear layers. The default
initialization by PyTorch is good enough.
• LeakyReLU is a variant of ReLU activation, which has a smaller gradient for negative inputs.
Set negative slope=0.2 for all LeakyReLU layers. More info about LeakyReLU at https:
//pytorch.org/docs/stable/generated/torch.nn.LeakyReLU.html.
• You need to reshape the tensor sometimes in the forward pass.
• Given a batch of images with shape (batch size, 1, 32, 32), the output of this network
should be a tensor with shape (batch size), and the values in it are float numbers in (0, 1).
Our autograder will only be able to check the shape and range of the output, so be careful even
if you have passed the test.
(b) Next, we can implement the Generator network. It should take 128-d vectors (sampled from a
Gaussian distribution) as input, and output fake images. Its architecture is summarized in Table 2.
We will make use of transposed convolutional layers. Given an input, a transposed convolutional
layer can produce an output with a higher resolution. Thus, we can generate a 32 × 32 image from a
vector by stacking such layers. A visualization of how transposed convolutional layers work can be
found at https://github.com/vdumoulin/conv_arithmetic/blob/master/README.md.
6
Table 2: Generator Architecture
Layer Layer Input Output Kernel Stride Padding
Index Type Channels Channels Size
1 ConvTranspose2d 128 64 4 1 0
2 LeakyReLU
3 ConvTranspose2d 64 32 4 2 1
4 LeakyReLU
5 ConvTranspose2d 32 16 4 2 1
6 LeakyReLU
7 ConvTranspose2d 16 1 4 2 1
8 Tanh
A few notes:
• Again, all Conv2d and Linear layers have bias terms and are initialized by the default setup.
• Same LeakyReLU as above, with negative slope=0.2 for all LeakyReLU layers.
• You need to reshape the tensor sometimes in the forward pass.
• Given a batch of latent vectors with shape (batch size, 128), the output of this network
should be a tensor with shape (batch size, 1, 32, 32), and the values in it are float numbers
in (−1, 1). Our autograder will only be able to check the shape and range of the output, so be
careful even if you have passed the test.
(c) In class we have learned that to jointly train the generator and discriminator, we optimize them with
a minimax objective:
V (G, D) :=
1
N
X
N
i=1
log D(xi) + 1
N
X
N
j=1
log(1 − D(G(zj )))
min
G
max
D
V (G, D)
Here N is the batch size (set to 64 in our implementation), xi
is a real image, zj is a random latent
variable sampled from a Gaussian distribution, and G(zj ) is a fake image generated from it. Note
that we are taking average to approximate the expectation, since we are using SGD to optimize G
and D.
Please complete the function calculate V() in GAN. You may (but not required to) use the binary
cross entropy loss (see https://pytorch.org/docs/stable/generated/torch.nn.BCELoss.html)
to simplify the implementation, but be careful about the sign and reduction method of BCELoss.
(d) We are ready to start training our GAN. The training pipeline is already provided in train(), and
there is a visualize() function for your convenience. Train our GAN for 10 epochs, and include
the generated images after training in your PDF submission.
Notes from TA:
• Training 10 epochs takes me about an hour on my laptop without GPU support. I can see
interesting images after two or three epochs.
• You can make use of Google Colab(https://colab.research.google.com/), where you can
access GPUs freely and accelerate the training. Remember to set Runtime->Change runtime
type->Hardware accelerator.
• Some random seeds may lead to degenerated results. It’s OK to try a few and manually set your
random seed (torch.manual seed()).
Solution.
7