$30
CS 541 Artificial Intelligence: Homework 3
Gradient Calculation
Suppose x ∈ R
d
and y ∈ R are known. Calculate the gradient of the following functions:
• Sigmoid function F(w) = 1/(1 + e
−x·w);
• Logistic loss F(w) = log (1 + e
−yx·w).
Note: we take both x and w as column vectors.
Linear Regression
Suppose we are given a data set {xi
, yi}
n
i=1, where each xi ∈ R
d × R is a row vector. We hope to learn
a mapping f such that each yi
is approximated by f(xi). Then a popular approach is to fit the data with
linear regression – it assumes there exists w ∈ R
d
such that yi ≈ w · xi
. In order to learn w from the data,
it typically boils down to solving the following least-squares program:
min
w∈Rd
F(w) := 1
2
ky − Xwk
2
2
, (1)
where X is the data matrix with the ith row being xi
, and y = (y1, y2, . . . , yn)
>.
1. Compute the gradient and the Hessian matrix of F(w), and show that (1) is a convex program.
2. Note that (1) is equivalent to the following:
min
w∈Rd
ky − Xwk
100
2
,
in the sense that any minimizer of (1) is also an optimum of the above, and vice versa. State why we
stick with the least-squares formulation.
3. State a possible condition on X such that F(w) is strongly-convex. Under which condition on X the
objective function is not strongly-convex.
4. Consider n = 100 and d = 40. Generate the data matrix X ∈ R
n×d
and the response y ∈ R
n
,
for example, using the python API numpy.random.randn. Then calculate the exact solution
1
w∗ =
X>X
−1 X>y of (1). Use python API to calculate the minimum and maximum eigenvalue
of the Hessian matrix, and derive the upper bound on the learning rate η in gradient descent. Let
us denote this theoretical bound by η0. Run GD on the data set with 6 choices of learning rate:
η ∈ {0.01η0, 0.1η0, η0, 2η0, 20η0, 100η0}. Plot the curve of “
wt − w∗
2
v.s. t” for 1 ≤ t ≤ 100
and summarize your observation. Note that you can start GD with w0 = 0.
5. Consider n = 100 and d = 200, and generate X and y. What happens when you are trying to calculate the closed-form solution w∗
? In this case, can we still apply GD? If yes, derive the theoretical
bound η0 and run GD with 6 different η as before. Plot the curve of “F(wt
) v.s. t” for 1 ≤ t ≤ 100
and summarize your observation.