$35
CPSC 340 Assignment 4 (due Nov 11 at 11:55pm)
Instructions
Rubric: {mechanics:5}
IMPORTANT!!! Before proceeding, please carefully read the general homework instructions
at https://www.cs.ubc.ca/~fwood/CS340/homework/. The above 5 points are for following the submission
instructions perfectly.
We use blue to highlight the deliverables that you must answer/do/submit with the assignment.
1 Convex Functions
Rubric: {reasoning:5}
Recall that convex loss functions are typically easier to minimize than non-convex functions, so it’s important
to be able to identify whether a function is convex.
Show that the following functions are convex:
1. f(w) = αw2 − βw + γ with w ∈ R, α ≥ 0, β ∈ R, γ ∈ R (1D quadratic).
2. f(w) = − log(αw) with α 0 and w 0 (“negative logarithm”)
3. f(w) = kXw − yk1 +
λ
2
kwk1 with w ∈ R
d
, λ ≥ 0 (L1-regularized robust regression).
4. f(w) = Pn
i=1 log(1 + exp(−yiw
T xi)) with w ∈ R
d
(logistic regression).
5. f(w) = Pn
i=1[max{0, |w
T xi − yi
|} − ?] + λ
2
kwk
2
2 with w ∈ R
d
, ? ≥ 0, λ ≥ 0 (support vector regression).
General hint: for the first two you can check that the second derivative is non-negative since they are onedimensional. For the last 3 you’ll have to use some of the results regarding how combining convex functions
can yield convex functions which can be found in the lecture slides.
Hint for part 4 (logistic regression): this function may seem non-convex since it contains log(z) and log is
concave, but there is a flaw in that reasoning: for example log(exp(z)) = z is convex despite containing a
log. To show convexity, you can reduce the problem to showing that log(1 + exp(z)) is convex, which can
be done by computing the second derivative. It may simplify matters to note that exp(z)
1+exp(z) =
1
1+exp(−z)
.
2 Logistic Regression with Sparse Regularization
If you run python main.py -q 2, it will:
1. Load a binary classification dataset containing a training and a validation set.
2. ‘Standardize’ the columns of X and add a bias variable (in utils.load dataset).
3. Apply the same transformation to Xvalidate (in utils.load dataset).
4. Fit a logistic regression model.
5. Report the number of features selected by the model (number of non-zero regression weights).
1
6. Report the error on the validation set.
Logistic regression does reasonably well on this dataset, but it uses all the features (even though only the
prime-numbered features are relevant) and the validation error is above the minimum achievable for this
model (which is 1 percent, if you have enough data and know which features are relevant). In this question,
you will modify this demo to use different forms of regularization to improve on these aspects.
Note: your results may vary a bit depending on versions of Python and its libraries.
2.1 L2-Regularization
Rubric: {code:2}
Make a new class, logRegL2, that takes an input parameter λ and fits a logistic regression model with
L2-regularization. Specifically, while logReg computes w by minimizing
f(w) = Xn
i=1
log(1 + exp(−yiw
T xi)),
your new function logRegL2 should compute w by minimizing
f(w) = Xn
i=1
log(1 + exp(−yiw
T xi))
+
λ
2
kwk
2
.
Hand in your updated code. Using this new code with λ = 1, report how the following quantities change:
the training error, the validation error, the number of features used, and the number of gradient descent
iterations.
Note: as you may have noticed, lambda is a special keyword in Python and therefore we can’t use it as a
variable name. As an alternative we humbly suggest lammy, which is what Mike’s niece calls her stuffed
animal toy lamb. However, you are free to deviate from this suggestion. In fact, as of Python 3 one can now
use actual greek letters as variable names, like the λ symbol. But, depending on your text editor, it may be
annoying to input this symbol.
2.2 L1-Regularization
Rubric: {code:3}
Make a new class, logRegL1, that takes an input parameter λ and fits a logistic regression model with
L1-regularization,
f(w) = Xn
i=1
log(1 + exp(−yiw
T xi))
+ λkwk1.
Hand in your updated code. Using this new code with λ = 1, report how the following quantities change:
the training error, the validation error, the number of features used, and the number of gradient descent
iterations.
You should use the function minimizers.findMinL1, which implements a proximal-gradient method to minimize the sum of a differentiable function g and λkwk1,
f(w) = g(w) + λkwk1.
This function has a similar interface to findMin, EXCEPT that (a) you only pass in the the function/gradient
of the differentiable part, g, rather than the whole function f; and (b) you need to provide the value λ. To
reiterate, your funObj should not contain the L1 regularization term; rather it should only implement
the function value and gradient for the training error term. The reason is that the optimizer handles the
non-smooth L1 regularization term in a specialized way (beyond the scope of CPSC 340).
2
2.3 L0-Regularization
Rubric: {code:4}
The class logRegL0 contains part of the code needed to implement the forward selection algorithm, which
approximates the solution with L0-regularization,
f(w) = Xn
i=1
log(1 + exp(−yiw
T xi))
+ λkwk0.
The for loop in this function is missing the part where we fit the model using the subset selected new, then
compute the score and updates the minLoss/bestFeature. Modify the for loop in this code so that it fits
the model using only the features selected new, computes the score above using these features, and updates
the minLoss/bestFeature variables. Hand in your updated code. Using this new code with λ = 1, report the
training error, validation error, and number of features selected.
Note that the code differs a bit from what we discussed in class, since we assume that the first feature is the
bias variable and assume that the bias variable is always included. Also, note that for this particular case
using the L0-norm with λ = 1 is equivalent to what is known as the Akaike Information Criterion (AIC) for
variable selection.
Also note that, for numerical reasons, your answers may vary depending on exactly what system and package
versions you are using. That is fine.
2.4 Discussion
Rubric: {reasoning:2}
In a short paragraph, briefly discuss your results from the above. How do the different forms of regularization
compare with each other? Can you provide some intuition for your results? No need to write a long essay,
please!
2.5 Comparison with scikit-learn
Rubric: {reasoning:1}
Compare your results (training error, validation error, number of nonzero weights) for L2 and L1 regularization with scikit-learn’s LogisticRegression. Use the penalty parameter to specify the type of regularization.
The parameter C corresponds to 1
λ
, so if you had λ = 1 then use C=1 (which happens to be the default
anyway). You should set fit_intercept to False since we’ve already added the column of ones to X and
thus there’s no need to explicitly fit an intercept parameter. After you’ve trained the model, you can access
the weights with model.coef_.
2.6 L1
2
regularization
Rubric: {reasoning:4}
Previously we’ve considered L2 and L1 regularization which use the L2 and L1 norms respectively. Now
consider least squares linear regression with “L1
2
regularization” (in quotation marks because the “L1
2
norm”
is not a true norm):
f(w) = 1
2
Xn
i=1
(w
T xi − yi)
2 + λ
X
d
j=1
|wj |
1/2
.
Let’s consider the case of d = 1 and assume there is no intercept term being used, so the loss simplifies to
f(w) = 1
2
Xn
i=1
(wxi − yi)
2 + λ
p
|w| .
3
Finally, let’s assume n = 2 where our 2 data points are (x1, y1) = (1, 2) and (x2, y2) = (0, 1).
1. Plug in the data set values and write the loss in its simplified form, without a summation.
2. If λ = 0, what is the solution, i.e. arg minw f(w)?
3. If λ → ∞, what is the solution, i.e., arg minw f(w)?
4. Plot f(w) when λ = 1. What is arg minw f(w) when λ = 1? Answer to one decimal place if appropriate.
5. Plot f(w) when λ = 10. What is arg minw f(w) when λ = 10? Answer to one decimal place if
appropriate.
6. Does L1
2
regularization behave more like L1 regularization or L2 regularization when it comes to
performing feature selection? Briefly justify your answer.
7. Is least squares with L1
2
regularization a convex optimization problem? Briefly justify your answer.
3 Multi-Class Logistic
If you run python main.py -q 3 the code loads a multi-class classification dataset with yi ∈ {0, 1, 2, 3, 4}
and fits a ‘one-vs-all’ classification model using least squares, then reports the validation error and shows
a plot of the data/classifier. The performance on the validation set is ok, but could be much better. For
example, this classifier never even predicts that examples will be in classes 0 or 4.
3.1 Softmax Classification, toy example
Rubric: {reasoning:2}
Linear classifiers make their decisions by finding the class label c maximizing the quantity w
T
c xi
, so we want
to train the model to make w
T
yi
xi
larger than w
T
c
0xi for all the classes c
0
that are not yi
. Here c
0
is a possible
label and wc
0 is row c
0 of W. Similarly, yi
is the training label, wyi
is row yi of W, and in this setting we are
assuming a discrete label yi ∈ {1, 2, . . . , k}. Before we move on to implementing the softmax classifier to fix
the issues raised in the introduction, let’s work through a toy example:
Consider the dataset below, which has n = 10 training examples, d = 2 features, and k = 3 classes:
X =
0 1
1 0
1 0
1 1
1 1
0 0
1 0
1 0
1 1
1 0
, y =
1
1
1
2
2
2
2
3
3
3
.
Suppose that you want to classify the following test example:
xˆ =
1 1
.
Suppose we fit a multi-class linear classifier using the softmax loss, and we obtain the following weight
matrix:
W =
+2 −1
+2 −2
+3 −1
Under this model, what class label would we assign to the test example? (Show your work.)
4
3.2 One-vs-all Logistic Regression
Rubric: {code:2}
Using the squared error on this problem hurts performance because it has ‘bad errors’ (the model gets
penalized if it classifies examples ‘too correctly’). Write a new class, logLinearClassifier, that replaces the
squared loss in the one-vs-all model with the logistic loss. Hand in the code and report the validation error.
3.3 Softmax Classifier Gradient
Rubric: {reasoning:5}
Using a one-vs-all classifier can hurt performance because the classifiers are fit independently, so there is
no attempt to calibrate the columns of the matrix W. As we discussed in lecture, an alternative to this
independent model is to use the softmax loss, which is given by
f(W) = Xn
i=1 "
−w
T
yi
xi + log X
k
c
0=1
exp(w
T
c
0xi)
!# ,
Show that the partial derivatives of this function, which make up its gradient, are given by the following
expression:
∂f
∂Wcj
=
Xn
i=1
xij [p(yi = c | W, xi) − I(yi = c)] ,
where...
• I(yi = c) is the indicator function (it is 1 when yi = c and 0 otherwise)
• p(yi = c | W, xi) is the predicted probability of example i being class c, defined as
p(yi = c | W, xi) = exp(w
T
c xi)
Pk
c
0=1 exp(wT
c
0xi)
3.4 Softmax Classifier Implementation
Rubric: {code:5}
Make a new class, softmaxClassifier, which fits W using the softmax loss from the previous section instead
of fitting k independent classifiers. Hand in the code and report the validation error.
Hint: you may want to use utils.check_gradient to check that your implementation of the gradient is
correct.
Hint: with softmax classification, our parameters live in a matrix W instead of a vector w. However, most
optimization routines like scipy.optimize.minimize, or the optimization code we provide to you, are set up
to optimize with respect to a vector of parameters. The standard approach is to “flatten” the matrix W into
a vector (of length kd, in this case) before passing it into the optimizer. On the other hand, it’s inconvenient
to work with the flattened form everywhere in the code; intuitively, we think of it as a matrix W and our
code will be more readable if the data structure reflects our thinking. Thus, the approach we recommend is
to reshape the parameters back and forth as needed. The funObj function is directly communicating with
the optimization code and thus will need to take in a vector. At the top of funObj you can immediately
reshape the incoming vector of parameters into a k × d matrix using np.reshape. You can then compute
the gradient using sane, readable code with the W matrix inside funObj. You’ll end up with a gradient
that’s also a matrix: one partial derivative per element of W. Right at the end of funObj, you can flatten
5
this gradient matrix into a vector using grad.flatten(). If you do this, the optimizer will be sending in a
vector of parameters to funObj, and receiving a gradient vector back out, which is the interface it wants –
and your funObj code will be much more readable, too. You may need to do a bit more reshaping elsewhere,
but this is the key piece.
3.5 Comparison with scikit-learn, again
Rubric: {reasoning:1}
Compare your results (training error and validation error for both one-vs-all and softmax) with scikit-learn’s
LogisticRegression, which can also handle multi-class problems. One-vs-all is the default; for softmax,
set multi_class=’multinomial’. For the softmax case, you’ll also need to change the solver. You can use
solver=’lbfgs’. Since your comparison code above isn’t using regularization, set C very large to effectively
disable regularization. Again, set fit_intercept to False for the same reason as above (there is already a
column of 1’s added to the data set).
3.6 Cost of Multinomial Logistic Regression
Rubric: {reasoning:2}
Assume that we have
• n training examples.
• d features.
• k classes.
• t testing examples.
• T iterations of gradient descent for training.
Also assume that we take X and form new features Z using Gaussian RBFs as a non-linear feature transformation.
1. In O() notation, what is the cost of training the softmax classifier with gradient descent?
2. What is the cost of classifying the t test examples?
Hint: you’ll need to take into account the cost of forming the basis at training (Z) and test (Z˜) time. It will
be helpful to think of the dimensions of all the various matrices.
4 Very-Short Answer Questions
Rubric: {reasoning:12}
1. Suppose that a client wants you to identify the set of “relevant” factors that help prediction. Why
shouldn’t you promise them that you can do this?
2. Consider performing feature selection by measuring the “mutual information” between each column
of X and the target label y, and selecting the features whose mutual information is above a certain
threshold (meaning that the features provides a sufficient number of “bits” that help in predicting the
label values). Without delving into any details about mutual information, what is a potential problem
with this approach?
3. What is a setting where you would use the L1-loss, and what is a setting where you would use L1-
regularization?
6
4. Among L0-regularization, L1-regularization, and L2-regularization: which yield convex objectives?
Which yield unique solutions? Which yield sparse solutions?
5. What is the effect of λ in L1-regularization on the sparsity level of the solution? What is the effect of
λ on the two parts of the fundamental trade-off?
6. Suppose you have a feature selection method that tends not generate false positives but has many false
negatives (it misses relevant variables). Describe an ensemble method for feature selection that could
improve the performance of this method.
7. Suppose a binary classification dataset has 3 features. If this dataset is “linearly separable”, what does
this precisely mean in three-dimensional space?
8. When searching for a good w for a linear classifier, why do we use the logistic loss instead of just
minimizing the number of classification errors?
9. What are “support vectors” and what’s special about them?
10. What is a disadvantage of using the perceptron algorithm to fit a linear classifier?
11. Why we would use a multi-class SVM loss instead of using binary SVMs in a one-vs-all framework?
12. How does the hyper-parameter σ affect the shape of the Gaussian RBFs bumps? How does it affect
the fundamental tradeoff?
7