$30
EECS E6720: Bayesian Models for Machine Learning
Homework 3
Please read these instructions to ensure you receive full credit on your homework.
Submit the written portion of your homework as a single PDF file through Courseworks (less
than 5MB). In addition to your PDF write-up, submit all code written by you in their original
extensions through Courseworks (e.g., .m, .r, .py, etc.). Any coding language is acceptable, but
your code should be your own. Do not submit Jupyter or other notebooks, but the original source
code only. Do not wrap your files in .rar, .zip, .tar and do not submit your write-up in .doc or
other file type. Your grade will be based on the contents of one PDF file and the original source
code. Additional files will be ignored. We will not run your code, so everything you are asked to
show should be put in the PDF file. Show all work for full credit.
Late submission policy: Late homeworks will have 0.1% deducted from the final grade for
each minute late. Your homework submission time will be based on the time of your last submission to Courseworks. Therefore, do not re-submit after midnight on the due date unless you are
confident the new submission is significantly better to overcompensate for the points lost. You
can resubmit as much as you like, but each time you resubmit be sure to upload all files you want
graded! Submission time is non-negotiable and will be based on the time you submitted your last
file to Courseworks. The number of points deducted will be rounded to the nearest integer.
Problem 1. (50 points)
We have a data set of the form {(xi
, yi)}
N
i=1, where y ∈ R and x ∈ R
d
. We assume d is large and
not all dimensions of x are informative in predicting y. Consider the following regression model
for this problem:
yi
ind ∼ Normal(x
T
i w, λ−1
), w ∼ Normal(0, diag(α1, . . . , αd)
−1
),
αk
iid∼ Gamma(a0, b0), λ ∼ Gamma(e0, f0).
Use the density function Gamma(η|τ1, τ2) = τ
τ1
2
Γ(τ1)
η
τ1−1
e
−τ2η
. In this homework, you will derive
a variational inference algorithm for approximating the posterior distribution with
q(w, α1, . . . , αd, λ) ≈ p(w, α1, . . . , αd, λ|y, x)
a) Using the factorization q(w, α1, . . . , αd, λ) = q(w)q(λ)
Qd
k=1 q(αk), derive the optimal form
of each q distribution. Use these optimal q distributions to derive a variational inference
algorithm for approximating the posterior.
b) Summarize the algorithm derived in Part (a) using pseudo-code in a way similar to how
algorithms are presented in the notes for the class.
c) Using these q distributions, calculate the variational objective function. You will need to
evaluate this function in the next problem to show the convergence of your algorithm.
1
Problem 2. (50 points)
Implement the algorithm derived in Problem 1 and run it on the three data sets provided. Set
the prior parameters a0 = b0 = 10−16 and e0 = f0 = 1. We will not discuss sparsity-promoting
“ARD” priors in detail in this course, but setting a0 and b0 in this way will encourage only a few
dimensions of w to be significantly non-zero since many αk should be extremely large according
to q(αk).
For each of the three data sets provided, show the following:
a) Run your algorithm for 500 iterations and plot the variational objective function.
b) Using the final iteration, plot 1/Eq[αk] as a function of k.
c) Give the value of 1/Eq[λ] for the final iteration.
d) Using ˆw = Eq(w)
[w], calculate ˆyi = x
T
i wˆ for each data point. Using the zi associated with
yi (see below), plot ˆyi vs zi as a solid line. On the same plot show (zi
, yi) as a scatter plot.
Also show the function (zi
, 10 ∗ sinc(zi)) as a solid line in a different color.
Hint about Part (d): z is the horizonal axis and y the vertical axis. Both solid lines should
look like a function that smoothly passes through the data. The second line is ground truth.
Details about the data
The data was generated by sampling z ∼ Uniform(−5, 5) independently N times for N =
100, 250, 500 (giving a total of three data sets). For each zn in a given data set, the response
yn = 10 ∗ sinc(zn) + ?n, where ?n ∼ N(0, 1).
We use zn to construct a “kernel matrix” X. This is a mapping of zn into a higher dimensional
space (see Bishop for more details). For our purposes, it’s just important to know that the nth
row (or column, depending on which data set you use) of X corresponds to the location zn. We
let Xn,1 = 1 and use the Gaussian kernel for the remaining dimensions, Xn,i+1 = exp{−(zn−zi)
2}
for i = 1, . . . , N. Therefore, the dimensionality of each xi
is one greater than the number of data
points. The sparse model picks out the relevant locations within the data for performing the
regression.
Each data set contains the vector y, the matrix X and the vector of original locations z. This
last vector will be useful for plotting.
2