$25
HW3: Variational Autoencoders
1 VAEs in 2D
1.1 Part A
In this problem, you will investigate when a VAE places information into its latent code. Run this
snippet generate two datasets:
import numpy as np
def sample_data_1():
count = 100000
rand = np.random.RandomState(0)
return [[1.0, 2.0]] + rand.randn(count, 2) * [[5.0, 1.0]]
def sample_data_2():
count = 100000
rand = np.random.RandomState(0)
return [[1.0, 2.0]] + (rand.randn(count, 2) * [[5.0, 1.0]]).dot(
[[np.sqrt(2) / 2, np.sqrt(2) / 2], [-np.sqrt(2) / 2, np.sqrt(2) / 2]])
Train one VAE in this configuration on both datasets:
• 2D latent variables z with a standard normal prior p(z) = N (z; 0, I).
• An approximate posterior qθ(z|x) = N (z; µθ(x), Σθ(x)), where µθ(x) is a mean vector and
Σθ(x) and is a diagonal covariance matrix
• A decoder pθ(x|z) = N (x; µθ(z), Σθ(z)), where µθ(z) is a mean vector and Σθ(z) and is a
diagonal covariance matrix
and train another in this configuration, also on both datasets:
• Prior and approximate posterior identical to above
• A decoder pθ(x|z) = N (x; µθ(z), σ2
θ
(z)I), where µθ(z) is a mean vector and σ
2
θ
(z) is a scalar.
1
Deliverables for all 4 choices of datasets and models:
1. Loss curves and final loss, in bits per dimension. Include separate curves and numbers for the
full ELBO, the KL term ExEz∼q(z|x)
[KL(q(z|x)kp(z))], the decoder term ExEz∼q(z|x)
[− log p(x|z)].
2. Samples from the full generation path (z ∼ p(z), x ∼ p(x|z)) and without decoder noise
(z ∼ p(z), x = µ(z)). Draw both in the same plot.
3. Is the VAE using the latent code? How do you know? If it does, what about the data does it
capture qualitatively?
1.2 Part B
Run the following code to generate data for training. It will generate a dataset of labeled samples:
the first element of the returned tuple is the data in R
2
, and the second element of the returned
tuple is the label in {0, 1, 2}.
You will train VAEs on unlabeled samples, and you will use the labels for visualization purposes
only. Take the first 80% of the samples as a training set and the remaining 20% as a test set.
import numpy as np
def sample_data_3():
count = 100000
rand = np.random.RandomState(0)
a = [[-1.5, 2.5]] + rand.randn(count // 3, 2) * 0.2
b = [[1.5, 2.5]] + rand.randn(count // 3, 2) * 0.2
c = np.c_[2 * np.cos(np.linspace(0, np.pi, count // 3)),
-np.sin(np.linspace(0, np.pi, count // 3))]
c += rand.randn(*c.shape) * 0.2
data_x = np.concatenate([a, b, c], axis=0)
data_y = np.array([0] * len(a) + [1] * len(b) + [2] * len(c))
perm = rand.permutation(len(data_x))
return data_x[perm], data_y[perm]
Train a VAE with a 2-dimensional z ∼ p(z) = N(z; 0, I) and a Gaussian encoder and decoder
(with diagonal covariance matrices).
Provide these deliverables:
1. Loss curves and final loss, in bits per dimension. Include separate curves and numbers for the
full ELBO, the KL term, and the decoder term; do so on the train and validation sets. What
is the final test set performance (all three numbers)?
2. Display samples.
3. Display latents for data. To do so, take labeled training data (x, y) and plot z ∼ qθ(z|x),
colored by the label y. Comment on the appearance of the latent spaces for the two models.
4. Pick the first 100 points in the test set. Evaluate and report the IWAE objective using 100
samples, and compare to the standard ELBO (i.e. IWAE with 1 sample).
2
2 High-dimensional data
In this question, you will be training the Variational Auto-Encoder on the SVHN dataset. The
SVHN dataset is split into train, valid and test samples: 65931, 7326 and 26032 respectively.
Here’s the recommended architecture (all convolutions are implemented with ’same’ padding) :
Residual_stack():
for _ in range(5):
relu(),
conv2d(n_filters=64, kernel_size=(3, 3), strides=(1, 1)),
relu(),
conv2d(n_filters=64*2, kernel_size=(3, 3), strides=(1, 1)),
gated_shortcut_connection() # https://arxiv.org/pdf/1612.08083.pdf,
relu()
Encoder():
conv2d(n_filters=128, kernel_size=(4, 4), strides=(2, 2)),
relu(),
conv2d(n_filters=256, kernel_size=(4, 4), strides=(2, 2)),
relu(),
conv2d(n_filters=256, kernel_size=(3, 3), strides=(1, 1)),
Residual_stack()
Decoder()
conv2d(n_filters=256, kernel_size=(3, 3), strides=(1, 1))
Residual_stack(),
conv2d_transpose(n_filters=128, kernel_size=(4, 4), strides=(2, 2)),
relu(),
conv2d_transpose(n_filters=3, kernel_size=(4, 4), strides=(2, 2))
a. Train a VAE with a diagonal covariance Gaussian decoder. The architecture above outputs
one value per color channel. You will have to change it to learn both the mean and variance of
each pixel. Choose the prior to be a standard normal gaussian and the approximate posterior to
be a diagonal covariance Gaussian. b. With the same encoder and decoder as in a., modify the
prior to be an autoregressive flow. This is precisely explained in [1] (Equations 12, 13 and 14). For
reference, here is an implementation that’s stable: Model the autoregressive flow on the prior as a
2D PixelCNN with 8x8 spatial dimension. A shallow PixelCNN with type A 3x3 mask in the first
layer followed by 3 residual layers of type B mask (as in Homework 1) is used to construct the scale
and translate terms of the flow per spatial dimension. Using an Adam Optimizer with a learning
rate of 2e-4 and applying tanh nonlinearities to logstds are recommended practices.
For both these models, report the following (and point out if the autoregressive prior makes a
difference and if it does, why):
1. Variational Lower Bound on train, validation and test (units: bits/dim) over the course of
training.
3
2. 100 samples from the trained models.
3. Interpolations between 5 pairs of test samples from the test set.
3 Bonus
Implement a limited receptive field autoregressive PixelCNN decoder with discretized mixture of
logistics output distribution. You will find [1] highly relevant.
References
[1] Xi Chen et al. “Variational lossy autoencoder”. In: arXiv preprint arXiv:1611.02731 (2016).
4