Starting from:

$30

HW 4 Logistic Regression as ERM


1. Logistic Regression as ERM (5 points)
This problems asks you to solve an exercise that was stated in the logistic regression notes.
Show that if L(y, t) = log(1 + exp(−yt)), then
1
n
Xn
i=1
L(yi
, wT xi + b)
is proportional to the negative log likelihood for logistic regression. Therefore ERM with the
logistic loss is equivalent to logistic regression, as was stated in class.
Note: In the above expression, y is assumed to be −1 or 1. In the logistic regression notes
we had y = 0 or 1.
2. Subgradient methods for the optimal soft margin hyperplane (25 points)
In this problem you will implement the subgradient and stochastic subgradient methods for
minimizing the convex but nondifferentiable function
J(w, b) = 1
n
Xn
i=1
L(yi
, wTxi + b) + λ
2
kwk
2
where L(y, t) = max{0, 1 − yt} is the hinge loss. As we saw in class, this corresponds to the
optimal soft margin hyperplane.
(a) (5 points) Determine Ji(w, b) such that
J(w, b) = Xn
i=1
Ji(w, b).
Determine a subgradient ui of each Ji with respect to the variable θ = [b wT
]
T
. A
subgradient of J is then P
i ui
.
Note: Recall that if f(z) = g(h(z)) where g : R → R and h : R
n → R, and both g and h
are differentiable, then
∇f(z) = ∇h(z) · g
0
(h(z)).
If g is convex and h is differentiable, the same formula gives a subgradient of f at z,
where g
0
(h(z)) is replaced by a subgradient of g at h(z).
Download the file nuclear.mat from Canvas. The variables x and y contain training data for
a binary classification problem. The variables correspond to the total energy and tail energy
of waveforms produced by a nuclear particle detector. The classes correspond to neutrons
and gamma rays. Neutrons have a slightly larger tail energy for the same total energy relative
to gamma rays, which allows the two particle types to be distinguished. This is a somewhat
large data set (n = 20, 000), and subgradient methods are appropriate given their scalability.
(b) (6 points) Implement the subgradient method for minimizing J and apply it to the nuclear
data. Submit two figures: One showing the data and the learned line, the other showing
J as a function of iteration number. Also report the estimated hyperplane parameters
and the minimum achieved value of the objective function.
Comments:
• Please execute the line
rng(0); \% in Matlab
or
random.seed(0) \# in Python
at the beginning of your code to seed the random number generator.
• Use λ = 0.001. Since this is a linear problem in a low dimension, we don’t need
much regularization.
• Use a step-size of αj = 100/j, where j is the iteration number.
• To compute the subgradient of J, write a subroutine to find the subgradient of Ji
,
and then sum those results.
• Since the objective will not be monotone decreasing, determining a good stopping
rule can be tricky. Just look at the graph of the objective function and “eyeball it”
to decide when the algorithm has converged.
• Debugging goes faster if you just look at a subsample of the data. To debug
in Matlab, the command keyboard is very helpful. Just type help keyboard
and also look up related commands. For something analagous in Python, try
http://stackoverflow.com/questions/13432717/enter-interactive-mode-in-python.
(c) (6 points) Now implement the stochastic subgradient method, which is like the subgradient method, except that your step direction is a subgradient of a random Ji
, not J.
Be sure to cycle through all data points before starting a new loop through the data.
Report/hand in the same items as for part (b).
More comments:
• Use the same λ, stopping strategy, and αj as in part (b). Here j indexes the number
of times you have cycled (randomly) through the data.
• Your plot of J versus iteration number will have roughly n times as many points as
in part (b) since you have n updates for every one update of the full subgradient
method.
• To generate a random permutation use
randperm
in Matlab, or
import numpy as np
np.random.permutation
in Python.
• Please reseed the random number generator as decribed previously.
(d) (3 points) Comment on the (empirical) rate of convergence of the stochastic subgradient
method relative to the subgradient method. Explain your finding.
(e) (5 points) Submit your code.
3. Kernels (10 points)
(a) (3 points) To what feature map Φ does the kernel
k(u, v) = (hu, vi + 1)3
correspond? Assume the inputs have an arbitrary dimension d and the inner product is
the dot product.
(b) (3 points) Show that if k1 and k2 are inner product kernels and a1, a2 ≥ 0, then a1k1+a2k2
is an inner product kernel.
(c) (4 points) Prove one direction of an equivalence stated in the notes, namely, that if k is
an inner product kernel, then k is a symmetric, positive definite kernel.
SUPPLEMENTAL EXERCISES
Not to be turned in.
1. The Gaussian Kernel is a Kernel
In this problem you are asked to show that the Gaussian kernel
k(x, x
0
) = C exp ?

1

2
kx − x
0
k
2
?
, C 0, σ 0,
is an inner product kernel (equivalently, symmetric and positive definite kernel). You will be
asked to verify various properties that let you construct kernels from other kernels, and you
will see that both characterizations of kernels (IP/SPD) are useful in this regard.
(a) Consider a sequence of symmetric, positive definite (SPD) kernels {kn} that coverges
pointwise to a function k, i.e., for all x, x
0
, limn→∞ kn(x, x
0
) = k(x, x
0
). Show that k is
also a SPD kernel.
(b) Prove that if k1(x, x
0
) and k2(x, x
0
) are SPD kernels, then so is k(x, x
0
) = k1(x, x
0
)k2(x, x
0
).
Hint: Consider zero mean, independent random vectors V1 and V2 whose covariances are
kernel matrices K1 and K2. Determine another random vector whose covariance is the
element-wise product of K1 and K2.
(c) Let f(t) = P∞
n=0 ant
n be a power series that converges for all t ∈ R, and for which all
an ≥ 0. Argue that k(x, x
0
) = P∞
n=0 an(k(x, x
0
))n
is a kernel.
(d) Deduce that the exponential kernel k(x, x
0
) = exp(γhx, x
0
i) is an inner product kernel,
where γ 0 and h·, ·i denotes the dot product on R
d
.
(e) Show that if k is an inner product kernel, then so is the normalized kernel
˜k(x, x
0
) = (
0, if k(x, x) = 0 or k(x
0
, x
0
) = 0
k(x,x
0
)
k(x,x)k(x0
,x0)
, otherwise.
(f) Deduce that the Gaussian kernel is an inner product kernel.

More products