$30
CS 189 Introduction to Machine Learning
HW4
Deliverables:
1. Submit your predictions for the test sets to Kaggle as early as possible. Include your Kaggle
scores in your write-up (see below). The Kaggle competition for this assignment can be
found at:
• https://www.kaggle.com/c/cs189-hw4-wine
2. Submit a PDF of your homework, with an appendix listing all your code, to the Gradescope
assignment entitled “Homework 4 Write-Up”. In addition, please include, as your solutions
to each coding problem, the specific subset of code relevant to that part of the problem. You
may typeset your homework in LaTeX or Word (submit PDF format, not .doc/.docx format)
or submit neatly handwritten and scanned solutions. Please start each question on a new
page. If there are graphs, include those graphs in the correct sections. Do not put them in an
appendix. We need each solution to be self-contained on pages of its own.
• In your write-up, please state with whom you worked on the homework.
• In your write-up, please copy the following statement and sign your signature next to it.
(Mac Preview and FoxIt PDF Reader, among others, have tools to let you sign a PDF
file.) We want to make it extra clear so that no one inadverdently cheats. “I certify
that all solutions are entirely in my own words and that I have not looked at another
student’s solutions. I have given credit to all external sources I consulted.”
3. Submit all the code needed to reproduce your results to the Gradescope assignment entitled “Homework 4 Code”. Yes, you must submit your code twice: in your PDF write-up
following the directions as described above so the readers can easily read it, and once in
compilable/interpretable form so the readers can easily run it. Do NOT include any data files
we provided. Please include a short file named README listing your name, student ID, and
instructions on how to reproduce your results. Please take care that your code doesn’t take
up inordinate amounts of time or memory. If your code cannot be executed, your solution
cannot be verified.
HW4, ©UCB CS 189, Spring 2019. All Rights Reserved. This may not be publicly shared without explicit permission. 1
1 Logistic Regression with Newton’s Method
Consider sample points X1, X2, . . . , Xn ∈ R
d
and associated values y1, y2, . . . , yn ∈ {0, 1}, an n × d
design matrix X = [X1 . . . Xn]
and an n-vector y = [y1 . . . yn]
.
If we add `2-regularization to logistic regression, the cost function is
J(w) = λ |w|
2 −
Xn
i=1
yi
ln si + (1 − yi) ln(1 − si)
where si = s(Xi
· w), s(γ) = 1/(1 + e
−γ
), and λ 0 is the regularization parameter. As in lecture,
the vector s = [s1 . . . sn]
is a useful shorthand.
In this problem, you will use Newton’s method to minimize this cost function on the four-point,
two-dimensional training set
X =
0 3
1 3
0 1
1 1
, y =
1
1
0
0
.
You may want to draw these points on paper to see what they look like. The y-vector implies that
the first two sample points are in class 1, and the last two are in class 0.
These sample points cannot be separated by a linear decision boundary that passes through the
origin. As described in lecture, append a 1 to each Xi vector and use a weight vector w ∈ R
3 whose
last component is the bias term (the term we call α in lecture).
1. Derive the gradient of the cost function J(w). Your final answer should be a simple matrixvector expression. While you may derive the matrix-vector form by first deriving the components of the gradient vector, do NOT write your answer in terms of these individual components.
2. Derive the Hessian of J(w). Again, your answer should be a simple matrix-vector expression.
3. State the update equation for one iteration of Newton’s method for this problem.
4. We are given a regularization parameter of λ = 0.07 and a starting point of w
(0) =
h
−2 1 0i
.
For the following four parts, you need only state the final solution. Thus you may derive the
solution by hand or implement Newton’s algorithm and report the final result. If you do the
latter, you do not need to submit code for this part.
(a) State the value of s
(0) (the value of s before any iterations).
(b) State the value of w
(1) (the value of w after one iteration).
(c) State the value of s
(1)
.
(d) State the value of w
(2) (the value of w after two iterations).
HW4, ©UCB CS 189, Spring 2019. All Rights Reserved. This may not be publicly shared without explicit permission. 2
2 `1- and `2-Regularization
Consider sample points X1, X2, . . . , Xn ∈ R
d
and associated values y1, y2, . . . , yn ∈ R, an n×d design
matrix X = [X1 . . . Xn]
and an n-vector y = [y1 . . . yn]
. For the sake of simplicity, assume
that the sample data has been centered and whitened so that each feature has mean 0 and variance
1 and the features are uncorrelated; i.e., X
X = nI. For this question, we will not use a fictitious
dimension nor a bias term; our linear regression function will output zero for x = 0.
Consider linear least-squares regression with regularization in the `1-norm, also known as Lasso.
The Lasso cost function is
J(w) = |Xw − y|
2 + λ kwk1
where w ∈ R
d
and λ 0 is the regularization parameter. Let w
∗ = arg min w∈Rd J(w) denote the
weights that minimize the cost function.
In the following steps, we will show that whitened training data decouples the features, so that w
∗
i
is determined by the i
th feature alone (i.e., column i of the design matrix X), regardless of the other
features. This is true for both Lasso and ridge regression.
1. We use the notation X∗i
to denote column i of the design matrix X, which represents the i
th
feature. Write J(w) in the following form for appropriate functions g and f .
J(w) = g(y) +
X
d
i=1
f(X∗i
,wi
, y, λ)
2. If w
∗
i 0, what is the value of w
∗
i
?
3. If w
∗
i < 0, what is the value of w
∗
i
?
4. Considering parts 2 and 3, what is the condition for w
∗
i
to be zero?
5. Now consider ridge regression, which uses the `2 regularization term λ |w|
2
. How does this
change the function f(·) from part 1? What is the new condition in which w
∗
i = 0? How does
it differ from the condition you obtained in part 4?
3 Regression and Dual Solutions
1. For a vector w, derive ∇ |w|
4
. Then derive ∇w |Xw − y|
4
.
2. Consider sample points X1, X2, . . . , Xn ∈ R
d
and associated values y1, y2, . . . , yn ∈ R, an n × d
design matrix X = [X1 . . . Xn]
and an n-vector y = [y1 . . . yn]
, and the regularized
regression problem
w
∗ = arg min
w∈Rd
|Xw − y|
4 + λ |w|
2
,
which is similar to ridge regression, but we take the fourth power of the error instead of the
squared error. (It is not possible to write the optimal solution w
∗
as the solution of a system
of linear equations, but it can be found by gradient descent or Newton’s method.)
HW4, ©UCB CS 189, Spring 2019. All Rights Reserved. This may not be publicly shared without explicit permission. 3
Show that the optimum w
∗
is unique. By setting the gradient of the objective function to
zero, show that w
∗
can be written as a linear combination w
∗ =
Pn
i=1 aiXi for some scalars
a1, . . . , an. Write the vector a of dual coefficients in terms of X, y, λ, and the optimal solution
w
∗
.
3. Consider the regularized regression problem
w
∗ = arg min
w∈Rd
1
n
Xn
i=1
L(w
Xi
, yi) + λ |w|
2
where the cost function L is convex in its first argument. Prove that the optimal solution has
the form w
∗ =
Pn
i=1 aiXi
. If the cost function is not convex, does the optimal solution always
have the form w
∗ =
Pn
i=1 aiXi? Justify your answer.
4 Wine Classification with Logistic Regression
The wine dataset given to you as part of the homework in data.mat consists of 6,497 data points,
each having 12 features. The description of these features is provided in data.mat. The dataset
includes a training set of 6,000 data points and a test set of 497 data points. Your classifier needs
to predict whether a wine is white (class label 0) or red (class label 1).
For this homework, you need to do the following.
1. Derive and write down the batch gradient descent update equation for logistic regression with
`2 regularization. (Not Newton’s method, but the slower batch gradient descent.)
Choose a reasonable regularization parameter value and a reasonable learning rate. Run your
algorithm and plot the cost function as a function of the number of iterations. (As this is
batch descent, one “iteration” should use every sample point once.)
2. Derive and write down the stochastic gradient descent update equation for logistic regression
with `2 regularization. Choose a suitable learning rate. Run your algorithm and plot the cost
function as a function of the number of iterations—where now each “iteration” uses just one
sample point.
Comment on the differences between the convergence of batch and stochastic gradient descent.
3. Instead of a constant learning rate ?, repeat part 2 where the learning rate decreases as ? ∝ 1/t
for the t
th iteration. Plot the cost function vs. the number of iterations. Is this strategy better
than having a constant ??
4. Finally, train your classifier on the entire training set. Submit your results to Kaggle. Your
classifier, when given the test points, should output a CSV file. (There is a sample one on
Kaggle.) You’ll upload this CSV file to Kaggle where it’ll be scored with both a public test
set, and a private test set. You will be able to see only your public score. You can only
submit twice per day, so get started early! In your writeup, for this problem, report your
HW4, ©UCB CS 189, Spring 2019. All Rights Reserved. This may not be publicly shared without explicit permission. 4
Kaggle username, the best score you achieved on Kaggle, and a short writeup describing
what you did to achieve that score.
IMPORTANT: Do NOT use any software package for logistic regression that you didn’t write
yourself!
5 Real World Spam Classification
Motivation: After taking CS 189 or CS 289A, students should be able to wrestle with “real-world”
data and problems. These issues might be deeply technical and require a theoretical background,
or might demand specific domain knowledge. Here is an example that a past TA encountered.
Daniel (a past CS 189 TA) interned as an anti-spam product manager for an email service provider.
His company uses a linear SVM to predict whether an incoming spam message is spam or ham. He
notices that the number of spam messages received tends to spike upwards a few minutes before
and after midnight. Eager to obtain a return offer, he adds the timestamp of the received message,
stored as number of milliseconds since the previous midnight, to each feature vector for the SVM
to train on, in hopes that the ML model will identify the abnormal spike in spam volume at night.
To his dismay, after testing with the new feature, Daniel discovers that the linear SVM’s success
rate barely improves.
Why can’t the linear SVM utilize the new feature well, and what can Daniel do to improve his
results? Daniel is unfortunately limited to only a quadratic kernel. This is an actual interview
question Daniel received for a machine learning engineering position!
Write a short explanation. This question is open ended and there can be many correct answers.
HW4, ©UCB CS 189, Spring 2019. All Rights Reserved. This may not be publicly shared without explicit permission. 5