$29.99
CSE512 – Machine Learning
Homework 2
Version 1 -
This homework contains 3 questions. The last question requires programming. The maximum number of
points is 100 plus 10 bonus points. Bonus points will not be used to fit the grade curve.
For Question 2 (and it is a good idea in general), we will use the following notations. Bold uppercase letters
denote matrices (e.g. D), bold lowercase letters denote column vectors (e.g. d). di represents the i
th column
of the matrix D. dij denotes the scalar in the row j
th and column i
th of the matrix D; it is also the j
th entry
of column vector di
. Non-bold letters represent scalar variables. 1k ∈ <k×1
is a column vector of ones.
0k ∈ <k×1
is a column vector of zeros. Ik ∈ <k×k
is the identity matrix. We use comma to denote
horizontal concatenation of two vectors (or matrices) (e.g., [a, b]). We use semicolon to denote vertical
concatenation of two vectors (e.g., [a; b]).
1 Question 1 – Parameter estimation (30 points)
This question uses a discrete probability distribution known as the Poisson distribution. A discrete random
variable X follows a Poisson distribution with parameter λ if
p(X = k|λ) = λ
k
k!
e
−λ
k ∈ {0, 1, 2, . . . }
The Poisson distribution is a useful discrete distribution which can be used to model the number of occurrences of something per unit time. For example, it can be used to model the number of customers arriving at
a bank per hour, or the number of calls handled by a telephone operator per minute.
1.1 Question 1.1 – MLE (10 points)
Consider catching the LIRR train from Stony Brook to Penn Station. Of course, the train is always late, and
assume the amount of delay in minutes is Poisson distributed (i.i.d) with parameter λ. Since you like Times
Square so much, you have visited NYC several times and record the amount of train delays in each trip.
Trip 1 2 3 4 5 6 7
Delay in minutes 4 5 3 5 6 9 10
Let X = (X1, . . . , Xn) be a random vector where Xi
is the number of delay minutes for your i
th trip:
1. (4 points) Give the log-likelihood function of X given λ.
2. (4 points) Compute the MLE for λ in the general case.
3. (2 point) Compute the MLE for λ using the observed X.
1.2 Question 1.2 – MAP (10 points)
Now let’s be Bayesian and put a prior over the parameter λ. You talk to your party-goer friends, who tell
you about the expected delays that they experience. You plot the distribution of the expected delays and your
1
extensive experience in statistics tells you that the plot resembles a Gamma distribution pdf. So you believe
a good prior distribution for λ may be a Gamma distribution. The Gamma distribution has pdf:
p(λ|α, β) = β
α
Γ(α)
λ
α−1
e
−βλ, λ > 0. (1)
Γ(α) is the Gamma function, which is the normalization factor that is introduced to have a proper probability
density function (i.e., sum to 1). Do not worry if you don’t know the explicit format of Γ(α); it is not
important for this question. Also, if λ ∼ Γ(α, β), then it has mean α/β and the mode is (α − 1)/β for
α > 1. Assume the prior distribution of λ is Γ(λ|α, β).
1. (5 points) Compute the posterior distribution over λ. Hint:
λ
Pn
i=1 Xi+α−1
e
−λ(n+β)
looks like a Gamma distribution! Is the rest of the expression constant with respect to λ? Working out
a messy integral can lead to the answer but it is unnecessary.
2. (5 points) Derive an analytic expression for the maximum a posterior (MAP) estimate of λ.
1.3 Question 1.3 – Estimator Bias (10 points)
In class, we saw that the maximum likelihood estimator for the variance of a Gaussian distribution:
σˆ
2
MLE =
1
N
X
N
i=1
(xi − µˆ)
2
is biased, and that an unbiased estimator of variance is:
σˆ
2
unbiased =
1
N − 1
X
N
i=1
(xi − µˆ)
2
For the Gaussian distribution, these estimators give similar results for large enough N, and it is unclear
whether one estimator is preferable to the other. In this problem, we will explore an example in which the
maximum likelihood estimate is dramatically superior to any unbiased estimator.
Suppose we are not quite interested in estimating λ. We are more interested in estimating a quantity that
is a nonlinear function of λ, namely η = e
−2λ
. Suppose we want to estimate η from a single observation
X ∼ P oisson(λ).
1. [4pts] Let ηˆ = e
−2X. Show that ηˆ is the maximum likelihood estimate of η.
2. [3pts] Show that the bias of ηˆ is e
−(1−1/e2
)λ − e
−2λ
. The following Taylor expansion may be useful:
e
x =
X∞
k=0
x
k
k!
3. [3pts] It turns out that (−1)X is the only unbiased estimate of η. Prove that it is indeed unbiased and
briefly explain why this is a bad estimator to use.
2
2 Question 2 – Ridge Regression and LOOCV (30 points)
In class, you learned about using cross validation as a way to estimate the true error of a learning algorithm.
The preferred solution is Leave-One-Out Cross Validation (LOOCV), which provides an almost unbiased
estimate of this true error, but it can take a really long time to compute. In this problem, you will derive a
formula for efficiently computing the LOOCV error for ridge regression.
Given a set of n data points and associated labels {xi
, yi
|xi ∈ <k
, yi ∈ <}n
i=1. Ridge regression find the
weight vector w and a bias term b to optimize the following:
minimize
w,b
λ||w||2 +
Xn
i=1
(wT xi + b − yi)
2
. (2)
2.1 (5 points)
Let w = [w; b], X = [X; 1
T
n
],
¯I = [Ik, 0k; 0
T
k
, 0], C = XX
T
+ λ¯I, and d = Xy. Show that the solution of
Ridge regression is:
w = C−1d (3)
2.2 (5 points)
Now suppose we remove xi from the training data, let C(i)
, d(i)
, w(i) be the corresponding matrices for
removing xi
. Express C(i)
in terms of C and xi
. Express d(i)
in terms of d and xi
.
2.3 (5 points)
Express C
−1
(i)
in terms of C−1
and xi
. Hint: use the Sherman-Morrison formula:
(A + uvT
)
−1 = A−1 −
A−1uvT A−1
1 + vT A−1u
(4)
2.4 (5 points)
Show that
w(i) = w + (C−1xi)
−yi + x
T
i w
1 − x
T
i C−1xi
(5)
2.5 (5 points)
Show that the leave-one-out error for removing the i
th training data is:
wT
(i)xi − yi =
wT xi − yi
1 − x
T
i C−1xi
(6)
3
2.6 (5 points)
The LOOCV is defined as: Pn
i=1(wT
(i)
xi − yi)
2
. What is the algorithmic complexity of computing LOOCV
error using the formula given in Question 2.5? How is it compared with the usual way of computing
LOOCV? Note that the complexity of inverting a k × k matrix is O(k
3
).
3 Question 3 – Programming [40 points]
3.1 Implement LOOCV Ridge Regression
Implement a function to perform Ridge Regression with LOOCV as explained in Question 2. You can use
either Matlab or any other programming language for this assignment. If you use Matlab, your function
should have the following signature:
[w, b, obj, cvErrs] = ridgeReg(X, y, λ)
Inputs:
• X: a k × n data matrix, for n data points, each has k dimensions.
• y: a n × 1 label vector. This is not necessarily binary vector.
• λ: the weight for regularization term of Ridge regression
Outputs:
• w: k × 1 vector
• b: a scalar value for the bias term
• obj: the value of the objective function, i.e., obj = λ||w||2 +
Pn
i=1(wT xi + b − yi)
2
• cvErrs: n×1 vector for the cross validation errors. cvErrs(i) is the leave-one-out error for removing
the i
th training data.
Other programming language should have similar function signature. We recommend you use Matlab. Matlab is a very useful language, and you should find that Matlab achieves reasonable performance for this
problem. Matlab has many toolboxes, and you cannot use any existing ridge function for the Statistics
toolboxes (similarly for other languages).
Before you get started, here are some hints that you may find helpful:
• The only matrix operations required are: multiplying a matrix and a vector, adding vectors, computing
dot product between two vectors, point-wise matrix operations (e.g., power of 2), calculate the sum of
columns or rows or a matrix. Try to use as much vector/matrix computation as possible.
• You need to implement the efficient method for computing the LOOCV errors. But start by implementing the ridge regression solution without LOOCV. Implement the ‘inefficient’ way of computing
the LOOCV errors by actually removing each training data in turn and recomputing the solution. Use
this inefficient method to validate the results provided by the efficient method.
Finally here are some pointers toward useful parts of Matlab
• B= repmat(A, [m, n]): replicate the matrix A.
4
• Use sum(A, dim) to compute the sum of row or column of a matrix A.
3.2 Predicting goodness points of a wine given its review [40 points + 10 bonus]
Well now put the Ridge Regression to work on some real data. For this homework, the task is to predict the
points the wine should get based on its review text alone.
For this task, you are free to use any of the features given, you can even generate your own features from the
data given. For instance, you could square the features, normalize some features, combine features, etc. to
improve your results.
The data was scraped from WineEnthusiast during the week of June 15th, 2017 and can be found on the analytics website Kaggle (https://www.kaggle.com/zynicide/wine-reviews). As a side note,
browsing other competitions on the site may also give you some ideas for class projects.) This dataset provides a variety of features, the points, description (review), variety, country, province etc. For this homework
we will be using the points and description features. Points are ratings within range 1- 100. However, the
dataset contains description for wines rated in range 80-100. We already provide you the feature vectors for
each review. You can download the data for this homework from
https://www.kaggle.com/t/342c03e816ee45d09cde62892c7091b2
The following files will be available for download on the project webpage:
trainData.csv Training data matrix for predicting number of stars
trainLabels.csv Training data matrix for predicting number of stars
valData.csv Validation data
valLabels.csv Validation labels
testData.csv Test data
featureTypes.txt List of features used
The data matrices are stored in csv format. Each line is a wine instance, containing various feature values.
Description for each feature is provided in featureTypes.txt.
In Matlab, you can use the following code to load train data and labels:
D = csvread(’trainData.csv’,1,0);
trLb = csvread(’trainLabels.csv’,1,0);
Don’t forget, the first column is IDs, which you might want to ignore for all the computation which follows.
For this part of the problem, you have the following tasks:
1. (12 points) Solve Ridge to predict the number of points a Wine will receive. Run Ridge on the training
set, with λ = 0.01, 0.1, 1, 10, 100, 1000, using previous solutions as initial conditions to each problem.
At each solution, record the root-mean-squared-error (RMSE) on training, validation and leave-oneout-cross-validation data.
Plot the train, validation and leave-one-out-cross-validation RMSE values together on a plot against
λ. Label each curve in the plot.
2. (7 points) What λ achieve the best best LOOCV performance? For the model using this λ, report the
objective value, the sum of square errors (on training data), the value of the regularization term.
3. (7 points) Using the model that you computed using λ that achieves best LOOCV performance, list
the top 10 most important features and the top 10 least important features. Comment if the weights
make sense intuitively.
5
4. (14 points + 10 bonus points) Use your model to predict the points for the reviews in test data. Save
the predicted values on a CSV file predTestLabels.csv (see 4.2 for the format). The order of
predicted values should correspond to the order of the reviews in testData.csv (counting starts
from 0).
Submit this predTestLabels.csv file and enter our Kaggle competition for fame. You must
use your Stony Brook email to register on Kaggle. We will maintain a leader board, and the top three
entries on the private leader board at the end of the competition (due date) will receive 10 bonus points.
The ranking is based on RMSE. The points for this questions will be given in the proportion of your
rank on public leader board on Kaggle. Report the RMSE.
To prevent exploiting test data, you are allowed to make a maximum of 3 submissions per 24 hours.
Your submission will be evaluated immediately and the leader board will be updated.
You are allowed to create new features from the existing set of features. You are allowed to perform
feature normalization on existing and new features. You can use both training and validation data to
train your classifier. You can also use Lasso regression instead of Ridge. If you use Lasso regression, you must implement it yourself. You cannot use Multilayer Perceptron or any Deep Learning
techniques for prediction.
4 What to submit?
4.1 Blackboard submission
You will need to submit both your code and your answers to questions on Blackboard. Put the answer file and
your code in a folder named: SUBID FirstName LastName (e.g., 10947XXXX lionel messi). Zip this folder
and submit the zip file on Blackboard. Your submission must be a zip file, i.e, SUBID FirstName LastName.zip.
The answer file should be named: answers.pdf. The first page of the answers.pdf should be the filled cover
page at the end of this homework. The remaining of the answer file should contain:
1. Answers (and derivations) to Questions 1 and 2
2. Answers and plots to Question 3.2.1
3. Answers to Question 3.2.2, 3.2.3 and 3.2.4
You can use Latex if you wish, but it is not compulsory.
4.2 Kaggle submission
For Question 3.2.4, you must submit a CSV file to get the RMSE from the competition site https://
www.kaggle.com/t/342c03e816ee45d09cde62892c7091b2. A submission file should contain
two columns: Id and Prediction. The file should contain a header and have the following format.
Id, P rediction
0, 82000
1, 106000
... ...
(7)
A sample submission file is available from the competition site and our handout.
6
5 Cheating warnings
Don’t cheat. You must do the homework yourself, otherwise you won’t learn. You cannot ask and discuss
with students from previous years. You cannot look up the solution online.
7
Cover page for answers.pdf
CSE512 Fall 2018 - Machine Learning - Homework 2
Your Name:
Solar ID:
NetID email address:
Names of people whom you discussed the homework with: