$30
CSCE 633
Homework II
Submission Guidelines: 1. Put all the documents into one folder, name that folder as
firstnamelastname UIN and compress it into a .zip file.
2. For coding problems, your submission should include a code file (either .py or .ipynb), and
also a pdf file (in one file together with other non-coding questions) to report the results that are
required in the question.
Problem 1: Least Absolute Deviation (15 points)
In class, we have assumed the following data generative model
y = f(x) + ϵ
where ϵ follows a standard gaussian distribution, i.e., ϵ ∼ N (ϵ|0, 1). Assume a linear model for
f(x) = w⊤x. We now modify the data generative model by assuming that ϵ follows a Laplacian
distribution whose probability density function is
p(ϵ) = λ
2
exp(−λ|ϵ|)
where λ is a positive constant. For more about Laplacian distribution please check the following
wiki page http://en.wikipedia.org/wiki/Laplace_distribution.
Based on the above noise model about ϵ, derive the log-liklihood for the observed training data
{(x1, y1), . . . ,(xn, yn)} and the objective function for computing the solution w. Does the problem
have a closed form solution like Least Square Regression?
Problem 2: Regression with Ambiguous Data (30 points)
In the regression model we talked about in class, we assume that for each training data point xi
,
its output value yi
is observed. However in some situations that we can not measure the exact
value of yi
. Instead we only have information about if yi
is larger or less than some value zi
. More
specifically, the training data is given as a triplet (xi
, zi
, bi), where
• xi
is represented by a vector ϕ(xi) = (ϕ0(xi), . . . , ϕM−1(xi))⊤
• zi ∈ R is a scalar, bi ∈ {0, 1} is a binary variable indicating that if the true output yi
is larger
than zi (bi = 1) or not bi = 0
1
Develop a regression model for the ambiguous training data (xi
, zi
, bi), i = 1, . . . , n.
Hint: Define a Gaussian noise model for y and derive a log-likelihood for the observed data.
You can derive the objective function using the error function given below (note that there is no
closed-form solution). The error function is defined as
erf(x) = 1
√
π
Z x
−x
e
−t
2
dt
It is known that
1
√
2π
Z x
−∞
e
−t
2/2
dt =
1
2
1 + erf
x
√
2
, and 1
√
2π
Z ∞
x
e
−t
2/2
dt =
1
2
1 − erf
x
√
2
Problem 3: Regularization Penalizes Large Magnitudes of Parameters (15 points)
In class, we have learned that when increasing the regularization parameter λ in the regularized
least square problem
min
w
1
2
∥Φw − y∥
2
2 +
λ
2
∥w∥
2
2
where y = (y1, . . . , yn) ∈ R
n
, Φ
⊤ = (ϕ(x1), . . . , ϕ(xn)) ∈ RM×n
, the magnitude of the optimal
solution will decrease. Let the optimal solution w∗ be
w∗ = (λI + Φ⊤Φ)−1Φ
⊤y
You are asked to show that the Euclidean norm of the optimal solution ∥w∗∥2 will decrease as λ
increases.
Hint: (1) use the result from the Problem 2 in homework 1. (2) for any vector u ∈ R
d
if
V
⊤V = I where V ∈ R
d×d
then ∥V u∥2 = ∥u∥2
Problem 4: Ridge Regression and Lasso (40 points)
In this problem, you are asked to learn regression models using Ridge regression and Lasso. The
data set that we are going to use is the E2006-tfidf1
.
The first column is the target output y, and the remaining columns are features in the form
of (feature index:feature value). You can load the data by sklearn2
. If we let x ∈ R
d denote the
feature vector, the prediction is given by w⊤x + w0, where w ∈ R
d
contains the coefficients for all
features and w0 is a intercept term. Denoting X⊤ = (x1, ..., xn), the problem becomes
min
w
1
2
∥Xw + w0 − y∥
2
2 +
λ
2
∥w∥
2
,
which is the Lasso regression problem when the regularization term ∥w∥
2 = ∥w∥
2
1
and is the Ridge
regression problem when the regularization term ∥w∥
2 = ∥w∥
2
2
.
You can use the Python sklearn library for Lasso3 and Ridge regression4
.
1
https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/regression.html
2
https://scikit-learn.org/stable/datasets/loading_other_datasets.html
3
http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html
4
https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html
2
(1) Solution of Ridge Regression and Lasso: Set the value of the regularization parameter λ = 0.1,
compute the optimal solution for Ridge regression and Lasso. Report the number of nonzero
coefficient in the solution w for both Ridge regression and Lasso. You may observe that the
solutions of Ridge regression and Lasso contain very different numbers of nonzero elements.
What is the cause of that? What can this imply? Justify your observation and hypothesis.
(Note: If you use the sklearn Lasso, the value of alpha should be set to λ/n, where n is the
number of training examples, and in the sklearn Ridge, set alpha to be λ. Same for following
questions.)
(2) Training and testing error with different values of λ: (i) For each value of λ in [0, 1e-5, 1e-3,
1e-2 , 0.1, 1, 10, 100, 1e3, 1e4, 1e5, 1e6] run the Ridge regression and Lasso on training data
to obtain a model w and then compute the root mean square error (RMSE5
) on both the
training and the testing data of the obtained model. (ii) Plot the error curves for root mean
square error on both the training data and the testing data vs different values of λ. You
need to show the curves, and discuss your observations of the error curves, and report the
best value of λ and the corresponding testing error. (iii) Plot the curve of number of nonzero
elements in the solution w vs different values of λ. Discuss your observations. (iv) Plot the
curve of ∥w∥
2
2
vs different values of λ. Discuss your observations.
(3) Cross-validation: Use the given training data and follow the 5-fold cross-validation procedure
to select the best value of λ for both Ridge regression and Lasso. Then train the model on
the whole training data using the selected λ and compute the root mean square error on the
testing data. Report the best λ and the testing error for both Ridge regression and Lasso.
5For a set of examples (xi, yi), i = 1, . . . , n, the root mean square error of a prediction function f(·) is computed
by RMSE = pPn
i=1(f(xi) − yi)
2/n.
3