Starting from:

$30

Introduction to Machine Learning HW3

CS 189 Introduction to Machine Learning

Deliverables:
1. Submit your predictions for the test sets to Kaggle as early as possible. Include your Kaggle
scores in your write-up (see below). The Kaggle competition for this assignment can be
found at:
• https://www.kaggle.com/c/cs189-hw3-mnist
• https://www.kaggle.com/c/cs189-hw3-spam
2. Submit a PDF of your homework, with an appendix listing all your code, to the Gradescope assignment entitled “HW3 Write-Up”. You may typeset your homework in LaTeX or
Word (submit PDF format, not .doc/.docx format) or submit neatly handwritten and scanned
solutions. Please start each question on a new page. If there are graphs, include those
graphs in the correct sections. Do not put them in an appendix. We need each solution to be
self-contained on pages of its own.
• In your write-up, please state with whom you worked on the homework.
• In your write-up, please copy the following statement and sign your signature next to it.
(Mac Preview and FoxIt PDF Reader, among others, have tools to let you sign a PDF
file.) We want to make it extra clear so that no one inadverdently cheats. “I certify
that all solutions are entirely in my own words and that I have not looked at another
student’s solutions. I have given credit to all external sources I consulted.”
3. Submit all the code needed to reproduce your results to the Gradescope assignment entitled
“HW3 Code”. Yes, you must submit your code twice: once in your PDF write-up (above) so
the readers can easily read it, and once in compilable/interpretable form so the readers can
easily run it. Do NOT include any data files we provided. Please include a short file named
README listing your name, student ID, and instructions on how to reproduce your results.
Please take care that your code doesn’t take up inordinate amounts of time or memory. If
your code cannot be executed, your solution cannot be verified.
4. The assignment covers concepts on Gaussian distributions and classifiers. Some of the material may not have been covered in lecture; you are responsible for finding resources to
understand it.
HW3, ©UCB CS 189, Spring 2019. All Rights Reserved. This may not be publicly shared without explicit permission. 1
1 Gaussian Classification
Let P(x | Ci) ∼ N(µi
, σ2
) for a two-category, one-dimensional classification problem with classes
C1 and C2, P(C1) = P(C2) = 1/2, and µ2 µ1.
(a) Find the Bayes optimal decision boundary and the corresponding Bayes decision rule.
(b) The Bayes error is the probability of misclassification,
Pe = P((misclassified as C1) | C2) P(C2) + P((misclassified as C2) | C1) P(C1).
Show that the Bayes error associated with this decision rule is
Pe =
1


Z ∞
a
e
−z
2
/2
dz
where a =
µ2 − µ1

.
2 Isocontours of Normal Distributions
Let f(µ, Σ) be the probability density function of a normally distributed random variable in R
2
.
Write code to plot the isocontours of the following functions, each on its own separate figure.
You’re free to use any plotting libraries available in your programming language; for instance, in
Python you can use Matplotlib.
(a) f(µ, Σ), where µ =


1
1


and Σ =


1 0
0 2

.
(b) f(µ, Σ), where µ =


−1
2


and Σ =


2 1
1 3

.
(c) f(µ1, Σ1) − f(µ2, Σ2), where µ1 =


0
2

, µ2 =


2
0


and Σ1 = Σ2 =


2 1
1 1

.
(d) f(µ1, Σ1) − f(µ2, Σ2), where µ1 =


0
2

, µ2 =


2
0

, Σ1 =


2 1
1 1


and Σ2 =


2 1
1 3

.
(e) f(µ1, Σ1) − f(µ2, Σ2), where µ1 =


1
1

, µ2 =


−1
−1

, Σ1 =


2 0
0 1


and Σ2 =


2 1
1 2

.
3 Eigenvectors of the Gaussian Covariance Matrix
Consider two one-dimensional random variables X1 ∼ N(3, 9) and X2 ∼
1
2
X1 + N(4, 4), where
N(µ, σ2
) is a Gaussian distribution with mean µ and variance σ
2
. Write a program that draws
n = 100 random two-dimensional sample points from (X1, X2) such that the ith value sampled
HW3, ©UCB CS 189, Spring 2019. All Rights Reserved. This may not be publicly shared without explicit permission. 2
from X2 is calculated based on the ith value sampled from X1. In your code, make sure to specify
the Random Number Generator seed that was used so your simulation is reproducible. For each of
the following parts, include the corresponding output of your program.
(a) Compute the mean (in R
2
) of the sample.
(b) Compute the 2 × 2 covariance matrix of the sample.
(c) Compute the eigenvectors and eigenvalues of this covariance matrix.
(d) On a two-dimensional grid with a horizonal axis for X1 with range [−15, 15] and a vertical
axis for X2 with range [−15, 15], plot
(i) all n = 100 data points, and
(ii) arrows representing both covariance eigenvectors. The eigenvector arrows should originate at the mean and have magnitudes equal to their corresponding eigenvalues.
(e) Let U = [v1 v2] be a 2 × 2 matrix whose columns are the eigenvectors of the covariance
matrix, where v1 is the eigenvector with the larger eigenvalue. We use U

as a rotation
matrix to rotate each sample point from the (X1, X2) coordinate system to a coordinate system
aligned with the eigenvectors. (As U
= U
−1
, the matrix U reverses this rotation, moving
back from the eigenvector coordinate system to the original coordinate system). Center your
sample points by subtracting the mean µ from each point; then rotate each point by U

,
giving xrotated = U

(x − µ). Plot these rotated points on a new two dimensional-grid, again
with both axes having range [−15, 15].
4 Classification
Suppose we have a classification problem with classes labeled 1, . . . , c and an additional “doubt”
category labeled c + 1. Let r : R
d → {1, . . . , c + 1} be a decision rule. Define the loss function
L(r(x) = i, y = j) =



0 if i = j i, j ∈ {1, . . . , c},
λr
if i = c + 1,
λs otherwise,
where λr ≥ 0 is the loss incurred for choosing doubt and λs ≥ 0 is the loss incurred for making a
misclassification. Hence the risk of classifying a new data point x as class i ∈ {1, 2, . . . , c + 1} is
R(r(x) = i|x) =
Xc
j=1
L(r(x) = i, y = j) P(Y = j|x).
(a) Show that the following policy obtains the minimum risk when λr ≤ λs
.
(1) Choose class i if P(Y = i|x) ≥ P(Y = j|x) for all j and P(Y = i|x) ≥ 1 − λr/λs
;
(2) Choose doubt otherwise.
(b) What happens if λr = 0? What happens if λr λs? Explain why this is consistent with what
one would expect intuitively.
HW3, ©UCB CS 189, Spring 2019. All Rights Reserved. This may not be publicly shared without explicit permission. 3
5 Maximum Likelihood Estimation
Let X1, . . . , Xn ∈ R
d be n sample points drawn independently from a multivariate normal distribution N(µ, Σ).
(a) Suppose the normal distribution has an unknown diagonal covariance matrix
Σ =


σ
2
1
σ
2
2
σ
2
3
.
.
.
σ
2
d


and an unknown mean µ. Derive the maximum likelihood estimates, denoted ˆµ and ˆσi
, for µ
and σi
. Show all your work.
(b) Suppose the normal distribution has a known covariance matrix Σ and an unknown mean Aµ,
where Σ and A are known d × d matrices, Σ is positive definite, and A is invertible. Derive
the maximum likelihood estimate, denoted ˆµ, for µ.
6 Covariance Matrices and Decompositions
As described in lecture, the covariance matrix Var(R) ∈ R
d×d
for a random variable R ∈ R
d with
mean µ is
Var(R) = Cov(R, R) = E[(R − µ) (R − µ)

] =


Var(R1) Cov(R1, R2) . . . Cov(R1, Rd)
Cov(R2, R1) Var(R2) Cov(R2, Rd)
.
.
.
.
.
.
.
.
.
Cov(Rd, R1) Cov(Rd, R2) . . . Var(Rd)


,
where Cov(Ri
, Rj) = E[(Ri − µi) (Rj − µj)] and Var(Ri) = Cov(Ri
, Ri).
If the random variable R is sampled from the multivariate normal distribution N(µ, Σ) with the
PDF
f(x) =
1
p
(2π)
d
|Σ|
e
−((x−µ)
Σ
−1
(x−µ))/2
,
then Var(R) = Σ.
Given n points X1, X2, . . . , Xn sampled from N(µ, Σ), we can estimate Σ with the maximum likelihood estimator
Σ =ˆ
1
n
Xn
i=1
(Xi − µ) (Xi − µ)

,
which is also known as the covariance matrix of the sample.
HW3, ©UCB CS 189, Spring 2019. All Rights Reserved. This may not be publicly shared without explicit permission. 4
(a) The estimate Σˆ makes sense as an approximation of Σ only if Σˆ is invertible. Under what
circumstances is Σˆ not invertible? Make sure your answer is complete; i.e., it includes all
cases in which the covariance matrix of the sample is singular. Express your answer in terms
of the geometric arrangement of the sample points Xi
.
(b) Suggest a way to fix a singular covariance matrix estimator Σˆ by replacing it with a similar but
invertible matrix. Your suggestion may be a kludge, but it should not change the covariance
matrix too much. Note that infinitesimal numbers do not exist; if your solution uses a very
small number, explain how to calculate a number that is sufficiently small for your purposes.
(c) Consider the normal distribution N(0, Σ) with mean µ = 0. Consider all vectors of length
1; i.e., any vector x for which |x| = 1. Which vector(s) x of length 1 maximizes the PDF
f(x)? Which vector(s) x of length 1 minimizes f(x)? (Your answers should depend on the
properties of Σ.) Explain your answer.
7 Gaussian Classifiers for Digits and Spam
In this problem, you will build classifiers based on Gaussian discriminant analysis. Unlike Homework 1, you are NOT allowed to use any libraries for out-of-the-box classification (e.g. sklearn).
You may use anything in numpy and scipy.
The training and test data can be found on in the post corresponding to this homework. Don’t use
the training/test data from Homework 1, as they have changed for this homework. Submit your
predicted class labels for the test data on the Kaggle competition website and be sure to include
your Kaggle display name and scores in your writeup. Also be sure to include an appendix of your
code at the end of your writeup.
(a) Taking pixel values as features (no new features yet, please), fit a Gaussian distribution to
each digit class using maximum likelihood estimation. This involves computing a mean and a
covariance matrix for each digit class, as discussed in lecture.
Hint: You may, and probably should, contrast-normalize the images before using their pixel
values. One way to normalize is to divide the pixel values of an image by the l2-norm of its
pixel values.
(b) (Written answer) Visualize the covariance matrix for a particular class (digit). How do the
diagonal terms compare with the off-diagonal terms? What do you conclude from this?
(c) Classify the digits in the test set on the basis of posterior probabilities with two different approaches.
(1) Linear discriminant analysis (LDA). Model the class conditional probabilities as Gaussians
N(µC, Σ) with different means µC (for class C) and the same covariance matrix Σ, which
you compute by averaging the 10 covariance matrices from the 10 classes.
To implement LDA, you will sometimes need to compute a matrix-vector product of the
form Σ
−1
x for some vector x. You should not try to compute the inverse of Σ (nor the
HW3, ©UCB CS 189, Spring 2019. All Rights Reserved. This may not be publicly shared without explicit permission. 5
determinant of Σ). Instead, you should find a way to solve the implied linear system
without computing the inverse.
Hold out 10,000 randomly chosen training points for a validation set. Classify each image
in the validation set into one of the 10 classes (with a 0-1 loss function). Compute the error
rate and plot it over the following numbers of randomly chosen training points: [100, 200,
500, 1,000, 2,000, 5,000, 10,000, 30,000, 50,000]. (Expect some variance in your error
rate when few training points are used.)
(2) Quadratic discriminant analysis (QDA). Model the class conditionals as Gaussians N(µC, ΣC),
where ΣC is the estimated covariance matrix for class C. (If any of these covariance matrices turn out singular, implement the trick you described in Q6.(b). You are welcome to
use k-fold cross validation to choose the right constant(s) for that trick.) Repeat the same
tests and error rate calculations you did for LDA.
(3) (Written answer.) Which of LDA and QDA performed better? Why?
(4) Using the mnist data.mat, train your best classifier for the training data and classify
the images in the test data. Submit your labels to the online Kaggle competition. Record
your optimum prediction rate in your submission. You are welcome to compute extra
features for the Kaggle competition. If you do so, please describe your implementation in
your assignment. Please use extra features only for the Kaggle portion of the assignment.
In your submission, include plots of error rate versus number of training examples for both
LDA and QDA. Similarly, include a plot of training and test error for each digit. Which
digit is easiest to classify? Include written answers where indicated.
(d) Next, apply LDA or QDA (your choice) to spam. Submit your test results to the online Kaggle
competition. Record your optimum prediction rate in your submission. If you use additional
features (or omit features), please describe them.
Optional: If you use the defaults, expect relatively low classification rates. The TAs suggest
using a bag-of-words model. You may use third-party packages to implement that if you wish.
Also, normalizing your vectors might help.
HW3, ©UCB CS 189, Spring 2019. All Rights Reserved. This may not be publicly shared without explicit permission. 6

More products