$30
COMS 4771 HW2
You are allowed to work in groups of (at max) three students. These group members don’t necessarily have to be the same from previous homeworks. Only one submission per group is required
by the due date on Gradescope. Name and UNI of all group members must be clearly specified on
the homework. You must cite all the references you used to do this homework. You must show your
work to receive full credit.
1 [Multi-class classification] In class, we have primarily been focusing on binary classifiers.
In this problem, we will study how one can use binary classification to do multiclass classification. Suppose we want to construct a k-class classifier f that maps data from some input
space X to the label space {1, 2, ..., k}. There are two popular methods to construct f from
just combining results from multiple binary classifiers.
– one-vs-rest (OvR) technique – for each class i ∈ [k], one can view the classification
problem as computing a function fi
: R
d → {class i, not class i} (i.e., assigning examples from class i the label 1 and all other classes the label 0). One can combine the
results for each fi
to construct a multiclass classifier f.
– one-vs-one (OvO) technique – For each pair of classes i, j ∈ [k] (i, j distinct), one can
view the classification problem as computing a function fij : R
d → {class i, class j}
(taking only training points with labels i or j). One can combine the results for each fij
to construct a multiclass classifier f.
(i) Describe how one can design a multiclass classifier f from OvR and OvO techniques.
Make sure the precisely (mathematically) define f for each technique. For a new test
example, how many calls to binary classification are made in each technique?
(ii) Assuming that your base binary classifiers can only be linear, show a training dataset
for each of the following cases. (Your example training dataset for each case must have
the following properties - (i) number of classes in the dataset k 2, (ii) dataset contains equal number of datapoints per class, and (iii) each class contains at least two
datapoints.)
– OvR gives better accuracy over OvO
– OvO gives better accuracy over OvR
– For any ? 0, both OvO and OvR give accuracy of at most ?. (your example
training set can depend on ?)
– For any ? 0, both OvO and OvR give accuracy of at least 1 − ?. (your example
training set can depend on ?)
(iii) From part (i) we observed that OvO/OvR techniques made certain number of calls to
binary classification during test time. Suppose our goal is to minimize the number of
1
calls made to binary classification during test time (let’s call this quantity c). Propose a
technique to construct a k-class classifier f from binary classifiers that minimizes c. For
your proposed technique, what is c? (i.e., express it in terms of parameters of the data,
such as, number of classes k, number of datapoints n, dimensionality of your dataset d,
etc). Prove that your technique is indeed minimizes c, that is, there is no other technique
that makes fewer binary classification calls than your technique during test time and still
achieve comparable accuracy.
2 [Constrained optimization] Show that the distance from the hyperplane g(x) = w·x+w0 =
0 to a point xa is |g(xa)|/kwk by minimizing the squared distance kx − xak
2
subject to the
constraint g(x) = 0.
3 [A better output Perceptron algorithm guarantee] In class, we saw that when the training
sample S is linearly separable with a maximum margin γ 0, then the Perceptron algorithm
run cyclically over S is guaranteed to converge after T ≤
R/γ?2
updates, where R is the
radius of the sphere containing the sample points. This does not guarantee however that the
hyperplane solution returned by Perceptron, i.e. wT achieves a margin close to γ.
(i) Show an example training dataset S in R
2
that has margin γ, and an order of updates
made by the Perceptron algorithm where the hyperplane solution returned has arbitrarily
bad margin on S.
(ii) Consider the following modification to the perceptron algorithm:
Modified Perceptron Algorithm
Input: training dataset S = (xi
, yi)i=1,...,n
Output: learned vector w
- Initialize w0 := 0, t := 0
- while there exists an example (x, y) ∈ S, such that 2y(wt · x) ≤ γkwtk
- set wt+1 := wt + yx
- set t := t + 1
- return wt.
(a) If the Modified Perceptron Algorithm (MPA) terminates after T rounds, what margin guarantee is achieved by the hyperplane wT returned by MPA? Justify your
answer.
(b) We will now prove step-by-step the mistake bound for the Modified Perceptron
Algorithm (MPA) algorithm.
i. Show that after T rounds T γ ≤ kwT k, and observe that if kwT k < 4R2/γ,
then T < 4R2/γ2
.
In what follows, we will assume that kwT k ≥ 4R2/γ.
ii. Show that for any iteration t when mistake was made, the following holds:
kwtk
2 ≤ (kwt−1k + γ/2)2 + R
2
.
iii. Infer from that that for any iteration t, we have
kwtk ≤ kwt−1k + γ/2 +
R2
kwt−1k + kwtk + γ/2
.
iv. Using the previous question, show that for any iteration tsuch that either kwt−1k ≥
4R
2
γ
or kwtk ≥ 4R
2
γ
, we have
kwtk ≤ kwt−1k +
3
4
γ.
v. Show that kw0k ≤ R ≤ 4R2/γ. Since by assumption we have kwT k ≥ 4R
2
γ
,
conclude that there must exist some largest iteration t0 such that kwt0−1k ≤
4R
2
γ
and kwt0
k ≥ 4R
2
γ
.
vi. Show that kwT k ≤ kwt0−1k +
3
4
T γ, and finally deduce the mistake bound.
4 [Making data linearly separable by feature space mapping] Consider the infinite dimensional feature space mapping
Φσ : R → R
∞
x 7→
?
1
|α − x| < σ
· exp
− 1/(1 − (|α − x|/σ)
2
)
?
?
α∈R
.
(It may be helpful to sketch the function f(α) := 1
|α| < 1
· exp
− 1/(1 − α
2
)
?
for
understanding the mapping and answering the questions below)
(i) Show that for any n distinct points x1, . . . , xn, there exists a σ 0 such that the mapping
Φσ can linearly separate any binary labeling of the n points.
(ii) Show that one can efficiently compute the dot products in this feature space, by giving
an analytical formula for Φσ(x) · Φσ(x
0
) for arbitrary points x and x
0
.
(iii) Given an input space X and a feature space mapping φ that maps elements from X to a
(possibly infinite dimensional) inner product space V . Let K : X × X → R be a kernel
function that can efficiently compute the inner products in V , that is, for any x, x0 ∈ X,
K(x, x0
) = φ(x) · φ(x
0
).
Consider a binary classification algorithm that predicts the label of an unseen instance
according to the class with the closest average. Formally, given a training set S =
(x1, y1), . . . ,(xm, ym), for each y ∈ {±1} define
cy :=
1
my
X
i:yi=y
φ(xi),
where my = |{i : yi = y}|. Assume that m+ and m− are nonzero. Then, the algorithm
outputs the following decision rule:
h(x) := (
1 kφ(x) − c+k ≤ kφ(x) − c−k
0 otherwise.
(a) Let w := c+ − c− and let b =
1
2
(kc−k
2 − kc+k
2
). Show that
h(x) = sign(hw, φ(x)i + b).
(b) Show that h(x) can be expressed via K(·, ·), without accessing individual entries
of φ(x) or w, thus showing that h(x) is efficiently computable.
5 [Predicting Restaurant-reviews via Perceptrons]
Download the review dataset hw2data 1.zip from the course website. This data set is
comprised of reviews of restaurants in Pittsburgh; the label indicates whether or not the
reviewer-assigned rating is at least four (on a five-point scale). The data are in CSV format (where the first line is the header); the first column is the label (label; 0 or 1), and
the second column is the review text (text). The text has been processed to remove nonalphanumeric symbols and capitalization. The data has already been separated into training
data reviews tr.csv and test data and reviews te.csv.
The first challenge in dealing with text data is to come up with a reasonable “vectorial” representation of the data so that one can use standard machine learning classifiers with it.
Data-representations
In this problem, you will experiment with the following different data representations.
1. Unigram representation.
In this representation, there is a feature for every word t, and the feature value associated
with a word t in a document d is
tf(t; d) := number of times word t appears in document d.
(tf is short for term frequency.)
2. Term frequency-inverse document frequency (tf-idf) weighting.
This is like the unigram representation, except the feature associated with a word t in a
document d from a collection of documents D (e.g., training data) is
tf(t; d) × log10(idf(t; D)),
where tf(t; d) is as defined above, and
idf(t; D) := |D|
number of documents in D that contain word t
.
This representation puts more emphasis on rare words and less emphasis on common
words. (There are many variants of tf-idf that are unfortunately all referred to by the
same name.)
Note: When you apply this representation to a new document (e.g., a document in the
test set), you should still use the idf defined with respect to D. This, however, becomes
problematic if a word t appears in a new document but did not appear in any document
in D: in this case, idf(t; D) = |D|/0 = ∞. It is not obvious what should be done in
these cases. For this homework assignment, simply ignore words t that do not appear in
any document in D.
4
3. Bigram representation.
In addition to the unigram features, there is a feature for every pair of words (t1, t2)
(called a bigram), and the feature value associated with a bigram (t1, t2) in a given
document d is
tf((t1, t2); d) := number of times bigram (t1, t2) appears consecutively in document d.
In the sequence of words “a rose is a rose”, the bigrams that appear are: (a, rose), which
appears twice; (rose, is); and (is, a).
For all the of these representations, ensure to do data “lifting”.
Online Perceptron with online-to-batch conversion
Once can implement the Online Perceptron algorithm with the following online-to-batch conversion process.
– Run Online Perceptron to make two passes through the training data. Before each pass,
randomly shuffle the order of the training examples. Note that a total of 2n + 1 linear
classifiers wˆ1, . . . , wˆ2n+1 are created during the run of the algorithm, where n is the
number of training examples.
– Return the linear classifier wˆfinal given by the simple average of the final n + 1 linear
classifiers:
wˆfinal :=
1
n + 1
2
Xn+1
i=n+1
wˆi
.
Recall that as Online Perceptron is making its pass through the training examples, it only
updates its weight vector in rounds in which it makes a prediction error. So, for example, if
wˆ1 does not make a mistake in classifying the first training example, then wˆ2 = ˆw1. You can
use this fact to keep the memory usage of your code relatively modest.
(i) What are some potential issued with Unigram representation of textual data?
(ii) Implement the online Perceptron algorithm with online-to-batch conversion. You must
submit your code via Courseworks to receive full credit.
(iii) Which of the three representations give better predictive performance? As always, you
should justify your answer with appropriate performance graphs demonstrating the superiority of one representation over the other. Example things to consider: you should
evaluate how the classifier behaves on a holdout ‘test’ sample for various splits of the
data; how does the training sample size affects the classification performance; etc.
(iv) For the classifier based on the unigram representation, determine the 10 words that have
the highest (i.e., most positive) weights; also determine the 10 words that have the lowest
(i.e., most negative) weights.
5
6 [Neural networks as universal function approximators] Here we will experimentally verify
that (feed-forward) neural networks can approximate any smooth function. Recall that a feedforward neural network is simply a combination of multiple ‘neurons’ such that the output of
one neuron is fed as the input to another neuron. More precisely, a neuron νi
is a computational unit that takes in an input vector ~x and returns a wighted combination of the inputs (plus
the bias associated with neuron νi) passed through an activation function σ : R → R, that is:
νi(~x; ~wi
, bi) := σ( ~wi
· ~x + bi). With this notation, we can define a layer ` of a neural network
N `
that takes in an I-dimensional input and returns a O-dimensional output as
N
`
(x) :=
ν
`
1
(~x), ν`
2
(~x), · · · , ν`
O(~x)
?
x ∈ R
I
= σ(WT
` ~x +~b`) W` ∈ R
dI×dO defined as [ ~w
`
1
, . . . , ~w`
O],
and ~b` ∈ R
dO defined as [b
`
1
, . . . , b`
O]
T
.
Here ~w
`
i
and b
`
i
refers to the weight and the bias associated with neuron ν
`
i
in layer `, and the
activation function σ is applied pointwise. An L-layer (feed-forward) neural network FL-layer
is then defined as a network consisting of network layers N 1
, . . . , N L, where the input to
layer i is the output of layer i − 1. By convention, input to the first layer (layer 1) is the actual
input data.
(i) Consider a nonlinear activation function σ : R → R defined as x 7→ 1/(1 + e
−x
). Show
that ∂σ
∂x = σ(x)(1 − σ(x)).
(ii) Consider a single layer feed forward neural network that takes a dI -dimensional input and returns a dO-dimensional output, defined as F1-layer(x) := σ(WT x + b) for
some dI × dO weight matrix W and a dO × 1 vector b. Given a training dataset
(x1, y1), . . . ,(xn, yn), we can define the average error (with respect to the network
parameters) of predicting yi from input example xi as:
E(W, b) := 1
2n
Xn
i=1
kF1-layer(xi) − yik
2
.
What is ∂E
∂W , and ∂E
∂b ?
(note: we can use this gradient in a descent-type procedure to minimize this error and
learn a good setting of the weight matrix that can predict yi from xi
.)
(iii) Although reasonably expressive, single layer neural networks are not flexible enough
to approximate arbitrary smooth functions. Here we will focus on approximating onedimensional functions f : R → [0, 1] (the range of the function is taken in the interval
[0, 1] only for convenience, one can transform any bounded function to this interval
by simple scaling and translating) with a multi-layer neural network. Consider a Llayer neural network FL-layer as described above. Given a sample of input-output pairs
(x1, y1), . . . ,(xn, yn) generated from an unknown fixed function f, one can approximate f from FL-layer by learning the appropriate parameters by minimizing the following
error function.
E(W1, b1, . . . , WL, bL) := 1
2n
Xn
i=1
kFL-layer(xi) − yik
2
=
1
2n
Xn
i=1
kN L
◦ · · · ◦ N 1
(xi) − yik
2
.
Assuming that our input data is one dimensional, that is each xi ∈ R and yi ∈ [0, 1],
implement a gradient descent procedure to learn the parameters of a two-layer (ie, L =
2) feed forward neural network. Note that since each yi
is 1-dimensional, layer N 2
only contains one neuron. Layer N 1
can contain an arbitrary number, say k, neurons.
Graphically the network you are implementing looks as follows.
. . .
input (x) network
output (ŷ)
Here is the pseudocode of your implementation.
Learn 2-layer neural network for 1-d functions
input: data (x1, y1), . . . ,(xn, yn),
. size of the intermediate layer k
- Initialize weight parameters (W1, b1),(W2, b2) randomly
- Repeat until convergence:
- for each training example (xi
, yi)
- compute the network output yˆi on xi
- compute gradients ∂E
∂W2
,
∂E
∂b2
,
∂E
∂W1
,
∂E
∂b1
- update weight parameters:
- Wnew
2
:= W2 − η
∂E
∂W2
, b
new
2
:= b2 − η
∂E
∂b2
- Wnew
1
:= W1 − η
∂E
∂W1
, b
new
1
:= b1 − η
∂E
∂b1
7
You must submit your code on Courseworks to receive full credit.
(iv) Download hw2data 2.mat from the course website. It contains two vector valued
variables X and Y . Each row in Y is a noisy realization of applying an unknown smooth
1-dimensional function to the corresponding row in X. You can visualize the relationship between these variables by plotting X on the x-axis and Y on the y-axis.
Use your implementation from part (iii) to learn the unknown mapping function which
can yield Y from X. Once the network parameters are learned, plot the network output
on the y-axis (in red) along with the given Y values (in blue) for each input value X on
the x-axis.
8