$30
COMS 4771 HW2
A printed copy of the homework is due at 5:30pm in class. You must show your work to receive
full credit.
1 [Classification with ‘unsure’ option] Often it is beneficial to construct classifiers which predict a label only when they are confident, and have an option to output ‘unsure’ if they are less
confident in exchange to incurring a penalty. If the penalty for outputting ‘unsure’ is not too
high, it may be a desirable action. Consider the penalty function
%(ˆy; Y = y) = ( 0 ˆy = y
λu yˆ = ‘unsure’
λs otherwise
where λu is the penalty incurred for outputting ‘unsure’ and λs is the penalty incurred for
mispredicting. Consider a joint distribution over the data X and labels Y .
(i) For a given test example x, what is the minimum possible penalty a classifier can yield?
(Hint: there are multiple cases depending on the value of P(Y |X = x). )
(ii) What happens if λu = 0? What happens if λu ≥ λs?
2 [Constrained optimization and duality]
(i) Show that the distance from the hyperplane g(x) = w · x + w0 = 0 to a point xa is
|g(xa)|/kwk by minimizing the squared distance kx − xak
2
subject to the constraint
g(x) = 0.
Consider the optimization problem
min
x∈Rd
kxk
such thatX
d
i=1
xi ≥ 5
(ii) Is this optimization problem a convex optimization problem? why or why not?
(iii) What is the Lagrange dual of this problem?
(iv) Does strong duality hold? why or why not?
1
3 [Making data linearly separable by feature space mapping] Consider the infinite dimensional feature space mapping
Φσ : R → R
∞
x 7→
?
max n
0, 1 −
α − x
σ
o?
α∈R
.
(It may be helpful to sketch the function f(α) := max{0, 1 − |α|} for understanding the
mapping and answering the questions below)
(i) Show that for any n distinct points x1, . . . , xn, there exists a σ 0 such that the mapping
Φσ can linearly separate any binary labeling of the n points.
(ii) Show that one can efficiently compute the dot products in this feature space, by giving
an analytical formula for Φσ(x) · Φσ(x
0
) for arbitrary points x and x
0
.
4 [Perceptron case study] We shall study the relative performance of different variations of the
Perceptron algorithm on the handwritten digits dataset from HW1.
Consider a sequence of training data (x1, y1), . . . ,(xn, yn) in an arbitrary but fixed order (the
labels yi are assumed to be binary in {−1, +1}).
Perceptron V0
learning:
- Initialize w0 := 0
- for t = 1, . . . , T
- pick example (xi
, yi), where i = (t mod n + 1)
- if yi(wt−1 · xi) < 0
- wt := wt−1 + yixi
- else
- wt := wt−1
classification:
f(x) := sign(wT · x)
Perceptron V1
learning:
- Initialize w0 := 0
- for t = 1, . . . , T
- pick example (xi
, yi), such that i := arg minj
yjwt−1 · xj
?
- if yi(wt−1 · xi) < 0
- wt := wt−1 + yixi
- else
- wT := wt−1; terminate
classification:
f(x) := sign(wT · x)
Perceptron V2
learning:
- Initialize w1 := 0, c0 := 0, k := 1
- for t = 1, . . . , T
- pick example (xi
, yi), where i = (t mod n + 1)
- if yi(wk · xi) < 0
- wk+1 := wk + yixi
- ck+1 := 1
- k := k + 1
- else
- ck := ck + 1
classification:
f(x) := signPk
i=1 ck sign(wk · x)
?
(i) Implement the three variations of the Perceptron algorithm for the 10-way digit classification problem.
You must submit your code to the TA to receive full credit.
(ii) Which Perceptron version is better for classification? You must justify your answer with
appropriate performance graphs demonstrating the superiority of one classifier over the
other. Example things to consider: you should evaluate how the classifier behaves on a
holdout test sample for various splits of the data; how does the training sample size and
the number of passes affects the classification performance.
(iii) Implement the Kernel Perceptron as described in lecture with a high degree (say, 5 to
10) polynomial kernel. How does it affect the classification on test data?