$30
CSE 512: Homework
Extra practice problems, ungraded
1. Gradients. Compute the gradients of the following functions. Give the exact dimension of the output.
(a) Linear regression. f(x) = 1
40 kAx − bk
2
2
, A ∈ R
20×10
(b) Sigmoid. f(x) = σ(c
T x), c ∈ R
5
, σ(s) = 1
1+exp(−x)
. Hint: Start by showing that σ
0
(s) = σ(s)(1 − σ(s)).
2. Convex or not convex. Are the following sets convex or not convex? Justify your answer.
(a) S = range(A) := {x : Az = x for some z}
(b) S = {x : x ≤ −1} ∪ {x : x ≥ 1} (Read: either x ≤ −1 or x ≥ 1.)
3. Am I positive semidefinite? A symmetric matrix X is positive semidefinite if for all u, u
T Xu ≥ 0. For each of the
following, either prove that the matrix is positive semidefinite, or find a counterexample.
(a) X =
1 1 1
1 1 1
1 1 1
(b) X =
−1 −1 −1
−1 1 −1
−1 −1 1
4. Convex or not convex. (1pt, 0.125 points each.) From lecture, we know that there are three ways of checking
whether a function is convex or not.
For any function, we can check if it satisfies the definition of convexity:
f(θx + (1 − θ)y) ≤ θf(x) + (1 − θ)f(y), ∀x, y, ∀0 ≤ θ ≤ 1.
For any differentiable function, we can check the first-order condition
f(x) − f(y) ≥ ∇f(y)
T
(x − y).
For any twice-differentiable function, we can check the second-order condition
∇2
f(x) is positive semidefinite, i.e. u
T ∇2
f(x)u ≥ 0 ∀u.
Use any of these ways to determine whether or not each function is convex. (You only need to use one of these
rules per function. Pick the one you think gets you to the answer the fastest!)
(a) f(x) = 1
2
(x
2
[1] − 2x[1]x[2] + x
2
[2])
(b) f(x) = |x| Hint: Again, remember the triangle inequality.
(c) f(x) = log(exp(x[1]) + exp(x[2]))
(d) f(x) = c
T x
1
Main assignment, graded
1. Gradient properties. (1 pt, 0.5 pts each.) Prove the following two properties of gradients:
(a) Linearity. If h(x) = αf(x) + βg(x), then ∇h(x) = α∇f(x) + β∇g(x).
(b) Chain rule. Show that if g(v) = f(Av), then ∇g(v) = AT ∇f(Av).
2. Gradients. (1 pts, 0.5 pts each.) Compute the gradients of the following functions. Give the exact dimension of
the output.
(a) Quadratic function. f(x) = 1
2
x
T Qx + p
T x + r, Q ∈ R
12×12 and Q is symmetric (Q[i,j] = Q[j,i]).
(b) Softmax function. f(x) = 1
µ
log(P8
i=1 exp(µx[i])), x ∈ R
8
, µ is a positive scalar
3. Convex or not convex. (1pt) Are the following sets convex or not convex? Justify your answer.
(a) (0.3 pts) S = {x :
P
i
xi = 0}
(b) (0.2 pts) S = {(x, y) : x
2 + y
2 = 1}
(c) (0.2 pts) S = {x : |x| ≤ 1}. Hint: Remember the triangle inequality.
4. Am I positive semidefinite? (1 pt, 0.5 pts each) A symmetric matrix X is positive semidefinite if for all u,
u
T Xu ≥ 0. For each of the following, either prove that the matrix is positive semidefinite, or find a counterexample.
(a) X =
1 2 3
2 1 4
3 4 1
(b) X =
1 0 0
0 2 0
0 0 3
5. Convex or not convex. (1pt, 0.25 points each.)
Use any of the three ways (definition, first order condition, or second order condition) to determine whether or not
each function is convex. (You only need to use one of these rules per function. Pick the one you think gets you to
the answer the fastest!)
(a) f(x) = 1/x, for x > 0
(b) f(x) = kxk∞
(c) f(x) = x
3
[3] + x
2
[2] + x[1]
(d) f(x) = kAx − bk
2
2
6. Polyfit via linear regression. (3 pts )
Download weatherDewTmp.mat. Plot the data. It should look like the following
2
We want to form a polynomial regression of this data. That is, given w = weeks and d = dew readings, we
want to find θ1, ..., θp as the solution to
minimize
θ∈Rp
1
2
Xm
i=1
(θ1 + θ2wi + θ3w
2
i + · · · + θpw
p−1
i − di)
2
. (1)
Form X and y such that (1) is equivalent to the least squares problem
minimize
θ∈Rp
1
2
kXθ − yk
2
2
. (2)
That is, for w the vector containing the week number, and y containing the dew data, form
X =
1 w1 w
2
1 w
3
1
· · · w
p−1
1
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
1 wm w
2
m w
3
m · · · w
p−1
m
.
(a) Linear regression.
i. Write down the normal equations for problem (2).
ii. Fill in the code to solve the normal equations for θ, and use it to build a predictor. To verify your code
is running correctly, the number after check number should be 1.341.
iii. Implement a polynomial fit of orders p = 1, 2, 3, 10, 100, for the weather data provided. Include a figure
that plots the original signal, overlaid with each polynomial fit. Comment on the “goodness of fit” for
each value of p.
(b) Ridge regression. Oftentimes, it is helpful to add a regularization term to (2), to improve stability. In other
words, we solve
minimize
θ∈Rp
1
2
kXθ − yk
2
2 +
ρ
2
kθk
2
2
. (3)
for some ρ > 0.
i. Again, write down the normal equations for (3). Your equation should be of form Aθ = b for some matrix
A and vector b that you specify.
ii. Write the code for solving the ridge regression problem and run it. To verify your code is running correctly,
the number after check number should be 1.206.
iii. Using ρ = 1.0, plot the weather data with overlaying polynomial fits with ridge regression. Provide these
plots for p = 1, 2, 3, 10, 100. Comment on the “goodness of fit” and the stability of the fit, and also
compare with the plots generated without using the extra penalty term.
(c) Conditioning.
i. An unconstrained quadratic problem is any problem that can be written as
minimize
θ
1
2
θ
T Qθ + c
T
θ + r (4)
for some symmetric positive semidefinite matrix Q, and some vector c and some scalar r. Show that the
ridge regression problem (3) is an unconstrained quadratic problem by writing down Q, c, and r in terms
of X and y such that (4) is equivalent to (3). Show that the Q you picked is positive semidefinite.
ii. In your code, write a function that takes in X and y, constructs Q as specified in the previous problem,
and returns the condition number of Q. Report the condition number κ(Q) for varying values of p and
ρ, by filling in the following table. Here, m = 742 is the total number of data samples. Report at least 2
significant digits. Comment on how much ridge regression is needed to affect conditioning.
p ρ = 0 ρ = m ρ = 10m ρ = 100m
1
2
5
10
3
iii. Under the same experimental parameters as the previous question, run ridge regression for each choice of
p and ρ, and fill in the table with the mean squared error of the fit:
mean squared error =
1
m
Xm
i=1
(x
T
i
θ − y[i])
2
where xi
is the ith row of X. Comment on the tradeoff between using larger ρ to improve conditioning
vs its affect on the final performance.
p ρ = 0 ρ = m ρ = 10m ρ = 100m
1
2
5
10
(d) Forcasting. Picking your favorite set of hyperparameters (p, ρ), forecast the next week’s dew point temperature.
Plot the forecasted data over the current observations. Do you believe your forecast? Why?
7. (2 pts ) Logistic regression for Binary MNIST
(a) (0.5 pt) What is the gradient of the logistic loss function
f(θ) = −
1
m
Xm
i=1
log(σ(yix
T
i
θ)), σ(s) = 1
1 + e−s
where yi ∈ {−1, 1}? (Hint: Check out problem 1 again.)
(b) Coding gradient descent. Do not use scikit-learn or other built in tools for this exercise. Please only
use the packages that are already imported in the notebook. 1
Open mnist logreg.ipynb. We will use logistic regression to diffrentiate 4’s from 9’s, a notoriously tricky
problem. Run the first box to see what the data looks like.
In the second box I have set up the problem for you by pulling out the train and test set, selecting out
only the data samples related to 4’s and 9’s. I have not altered the data in any other way. While other
normalization tricks will help you improve the accuracy, for the purposes of this exercise we will forgo
them, so that it’s easy to compare everyone’s solutions.
Fill in the next box by providing code that will return the loss function value and the gradient. Make
sure that everything is normalized, e.g. don’t forget the 1/m term in the front of our loss function. Run
the script. If done correctly, you should see
45.192, 12343.177
Write a script that returns the classification accuracy given θ.
Use gradient descent to minimize the logistic loss for this classification problem. Use a step size of 10−6
.
(1 pt) Run for 1500 iterations. In your report, give the plot of the train loss and train/test misclassification
rate, plotted as a function of iterations. Report the final train and test accuracy values.
(c) Coding stochastic gradient descent. Do not use scikit-learn or other built in tools for this exercise.
Please only use the packages that are already imported in the notebook. Now, fill in the next box a function
that takes in θ and a minibatch B as either a list of numbers or as an np.array, and returns the minibatch
gradient
∇Bf(θ) = 1
|B|
X
i∈B
∇fi(θ)
where fi(θ) is the contribution to the gradient from datapoint i:
fi = − log(σ(yix
T
i
θ)).
Run the script. If done correctly, you should see the number 5803.5 printed out.
1This is to test your understanding of the basic machine learning concepts, like calculating accuracy and logistic loss; in the future you can
use whatever tools you’d like.
4
(d) Write a script to run stochastic gradient descent over logistic regression. When coding up the minibatching,
make sure you cycle through an entire training set once before moving on to the next epoch. Additionally,
use time() to record the runtime, and compare the performance of gradient descent and stochastic gradient
descent, using a minibatch size of 50 data samples and running for 50000 iterations. Return a plot that
compares the objective loss, train accuracy, and test accuracy between the two optimization methods, as a
function of runtime. Comment on the pros and cons of the two methods.
Important Remember that calculating the loss function and train/test accuracy requires making full passes
through the data. If you do this at each iteration, you will not see any runtime benefit between stochastic
gradient descent and gradient descent. Therefore I recommend you log these values every 10 iterations for
gradient descent, and every 100 iterations for stochastic gradient descent.
5
Challenge!
1. (1 pt.) Gradient descent for ridge regression. Consider the problem
minimize
x
F (x)
z }| {
1
2
kAx − bk
2
2
| {z }
=:f(x)
+
ρ
2
kxk
2
2
(5)
where A ∈ R
m×n and n > m. Justify all answers.
(a) Define C = AT A. Recall that an eigenvalue of a symmetric matrix C is λ where Cv = λv for some vector v.
Show that if λ
2
max is the largest eigenvalue of C and λ
2
min the minimum eigenvalue of C, then for any vector u,
λ
2
minkuk2 ≤ kCuk2 ≤ λ
2
maxkuk2.
Hint: The eigenvalue decomposition of a symmetric matrix can be written as C = V ΛV
T where V =
v1 v2 · · · vn
contain the eigenvalues vi of C, and Λ = diag(λ1, λ2, · · · , λn) contain the corresponding eigenvalues λi
. (That is, Cvi = λivi
.) Under certain conditions which we will just assume 2
then V is an
orthonormal matrix, e.g. V
T V = V V T = I. Then, we can use this to form projections, e.g. pick any vector
u. Then V V T u = V
T V u = u.
(b) Show that since n > m, then if C = AT A then λmin(C) = 0. Hint: The nonzero eigenvalues of AAT and of
AT A are the same.
(c) We say a function f : R
n → R is L-smooth if its gradient is L-Lipschitz, e.g. for some L > 0,
k∇f(x) − ∇f(y)k2 ≤ Lkx − yk2, ∀x, y.
Is f(x) as defined in (5) L-smooth? What about F(x)?
(d) We say a function f : R
n → R is µ-strongly convex if it is tangent to a quadratic function that is strictly
under it; that is, for some µ > 0,
f(x) − f(y) ≥ ∇f(y)
T
(x − y) + µ
2
kx − yk
2
2
, ∀x, y.
If this is only true for µ = 0, we say f is convex, but not strongly convex.
Is the function f(x) in (5) µ-strongly convex? What about F(x)? Hint: This problem requires less work if
you first answer for F(x), and then take ρ = 0 for f(x).
(e) Linear convergence. We will now show that gradient descent on minimizing F(x) converges linearly. First,
recall that gradient descent iterates as
x
(t+1) = x
(t) − α∇F(x
(t)
)
for some step size α > 0.
i. Show that for any point x
∗ where ∇F(x
∗
) = 0,
kx
(t+1) − x
∗
k2 ≤ ckx
(t) − x
∗
k2
for some c < 1. What is c?
ii. Use this to argue that gradient descent on (5) converges with linear complexity, e.g. the error f(x
(t)
) −
f(x
∗
) = O(c
t
). 3
2eigenvalues must all have algebraic multiplicity = geometric multiplicity
3
It’s a weird convention, but we say O(c
t
) is a linear rate because the loglog graph looks linear. I don’t make the rules, I just share them
with you.
6
2. (1pt) Linear regression without strong convexity still gets linear convergence. Now consider linear regression with
A =
1 1
1 1
, b =
1
−1
, and ρ = 0 (no ridge). That is, we consider only
f(x) = 1
2
kAx − bk
2
2
.
(a) Linear algebra. For a matrix A, the nullspace of A is the set
null(A) = {x : Ax = 0},
and the range of A is the set
range(A) = {Ax for any x}.
Show that for
u =
1
−1
, v =
1
1
,
then u ∈ null(A) and v ∈ range(A).
(b) Linear decomposition theorem part 1. Show that for any vector u ∈ null(A) and v ∈ range(AT
), it must be
that u
T v = 0.
(c) Linear decomposition theorem part 2. Argue also that for any vector x, there exist some u ∈ null(A) and
v ∈ range(AT
) where x = u + v. Do this by providing two matrices P and Q where P x = u and Qx = v,
using A =
1 1
1 1
. (This matrix will be unique.)
(d) Linear regression doesn’t pick up nullspace components. Suppose x
(0) =
0
0
. Now we run the gradient descent
method
x
(t+1) = x
(t) − α∇f(x
(t)
) (6)
to get x
(1)
, x
(2)
, ...
Show that using this initial point, Qx(t) = 0 for all t, where Q is the matrix computed in the previous question.
(e) Linear regression doesn’t pick up nullspace components, another example. Now suppose that for some x
(0)
,
Qx(0) = r 6= 0. Again, we run the gradient descent method. Show that Qx(t) = r for all t. (That is, r does
not depend on t!)
(f) Now consider a reduced gradient descent problem, where we minimize over a scalar variable v
minimize
v∈R
g(v) = 1
2
kASv − P bk
2
2
, S =
1
√
2
1
1
.
Argue that, using gradient descent
v
(t+1) = v
(t) − α∇g(v
(t)
),
the iterates x
(t) = Sv(t) are exactly those outputted by (6) where x
(0) = Sv(0)
.
Hint: Start by showing that SST ∇f(x) = ∇f(x).
(g) Finally, show that g(v) is strongly convex in u.
(h) Show that for this specific choice of A, gradient descent converges linearly, e.g. f(x
(t)
) − f(x
∗
) = O(c
t
), and
carefully state what c is.
(i) Now consider any matrix A. Argue that this entire problem basically shows that gradient descent, minimizing
linear regression, will always converge at a linear rate, and describe what c is.
7