Starting from:

$30

Homework 7: regularization

CS/ECE/ME 532
Homework 7: regularization

1. Alternative regularization formulas. This problem is about two alternative ways of solving the
L2-regularized least squares problem.
a) Prove that for any λ > 0, the following matrix identity holds:
(A
TA + λI)
−1A
T = A
T
(AAT + λI)
−1
Hint: Start by considering the expression ATAAT + λAT and factor it in two different ways
(from the right or from the left).
b) The identity proved in part a) shows that there are actually two equivalent formulas for the
solution to the L2-regularized least squares problem. Generate random matrices A ∈ R
8000×100
and b ∈ R
8000, and find x that minimizes kAx − bk
2+λ kxk
2
in two different ways. Use λ = 0.01.
Which formula computes more rapidly? Why? Note: you can use Matlab’s tic and toc
functions to measure execution time.
2. Getting a solver up and running. Your task for this problem is to get a solver up and running
and to solve a simple problem with it.
• If you’re using Matlab, install CVX: http://cvxr.com/cvx/
• If you’re using Python, install CVXPY: http://www.cvxpy.org/en/latest/
• If you’re using Julia, install Convex: https://github.com/JuliaOpt/Convex.jl
Once your are up and running, write code to solve the regularized least squares problem:
minimize
x,y








1 0
2 1
1 3



x
y




4
5
7








2
+ 0.1





x
y




2
Also solve this problem analytically and verify that you get the same answer.
3. Sparse classification. For this problem, we’ll perform a more thorough analysis using the breast
cancer dataset shown in class, bcdata.csv. The goal is to solve the Lasso problem
minimize
x∈Rn
kAx − bk
2 + λ kxk1
Here x is the weight vector applied to the expression levels of n = 8141 genes and b is the vector of
labels (+1 and −1) for each of the 295 patients. In the bcdata.csv, the first column corresponds
to b and the rest of the columns are the A matrix. In this problem we will vary λ to explore the
trade-off between data-fitting and sparsity, and we will compare the Lasso to Ridge Regression.
a) For each λ, find the optimal weights using only the first 100 patients (first 100 rows). Plot
a trade-off curve with the squared residual kAx − bk
2
on the vertical-axis and kxk1
on the
horizontal-axis. Your plot should contain the residual from the training data. Explain what you
see. Note: you’ll have to play around with how you choose λ to see the whole trade-off curve.
When using all the columns of A, your code may take up to 30 seconds to run for each λ. So
make sure your code is working properly (test it on a smaller subset of the columns first!) you
run it on the full dataset.
1 of 2
b) With your solutions from part a), produce a new trade-off curve. This time, plot the error rate
on the vertical-axis versus the sparsity on the horizontal-axis. Here, the error rate is the number
of incorrect predictions divided by the total number of predictions. The sparsity is the number
of nonzero entries in x. For this purpose, we’ll say an entry xi
is nonzero if |xi
| > 10−5
. Again,
the vertical-axis should contain the error rate from the training data.
c) Repeat parts a) and b). This time for the vertical-axis use the residual (respectively the error
rate) from the validation data. For validation, use the remaining rows of the data matrix (the
ones not used for training). Again, explain what you see in both trade-off plots.
d) Compare the performance of the Lasso and Ridge Regression in this application. Do this via
the following steps.
(i) split the 295 patients into 10 subsets of size 29-30. We will use these subsets for training,
tuning, and testing.
(ii) pick 8 of the subsets (about 240 patients) for training. Compute the solution xb for both the
Lasso and Ridge Regression problems, each for a variety of λ values.
(iii) Out of the two remaining subsets, pick one for tuning. Evaluate your classifiers xb on the
tuning data, and select the classifier with the smallest prediction error (one for Lasso and
one for Ridge Regression).
(iv) finally, evaluate your best Lasso and Ridge Regression classifiers on the remaining subset
of patients. How do they compare?
To get a better sense of the performance, you can repeat this entire process by taking different
subsets of 8, 1, and 1 for training, tuning, and testing and compute the average squared errors
and average number of misclassifications.
2 of 2

More products