$30
Problem 1.
Consider the training objective 𝐽 = ||𝑋𝑤 − 𝑡|| subject to for some constant .
2
||𝑤||
2 ≤ 𝐶 𝐶
How would the hypothesis class capacity, overfitting/underfittting, and bias/variance vary
according to 𝐶?
Larger 𝐶 Smaller 𝐶
Model capacity (large/small?) _____ _____
Overfitting/Underfitting? __fitting __fitting
Bias variance (how/low?) __ bias / __ variance __ bias / __ variance
Note: No proof is needed
Problem 2.
Consider a one-dimensional linear regression model 𝑡 with a Gaussian prior
(𝑚) ∼ 𝑁(𝑤𝑥
(𝑚)
, σ
ϵ
2
)
𝑤 ∼ 𝑁(0, σ . Show that the posterior of is also a Gaussian distribution, i.e., 𝑤
2
) 𝑤
𝑤|𝑥 . Give the formulas for .
(1)
, 𝑡
(1)
, ···, 𝑥
(𝑀)
, 𝑡
(𝑀) ∼ 𝑁(µ
𝑝𝑜𝑠𝑡
, σ
𝑝𝑜𝑠𝑡
2
) µ
𝑝𝑜𝑠𝑡
, σ
𝑝𝑜𝑠𝑡
2
Hint: Work with 𝑃(𝑤|𝐷) ∝ 𝑃(𝑤)𝑃(𝐷|𝑤). Do not handle the normalizing term.
Note: If a prior has the same formula (but typically with different parameters) as the posterior, it
is known as a conjugate prior. The above conjugacy also applies to multi-dimensional Gaussian,
but the formulas for the mean vector and the covariance matrix will be more complicated.
Problem 3.
Give the prior distribution of 𝑤 for linear regression, such that the max a posteriori estimation is
equivalent to 𝑙 -penalized mean square loss.
1
Note: Such a prior is known as the Laplace distribution. Also, getting the normalization factor in
the distribution is not required.
END OF W5