$29
1 The ”warm-up”
1. Assume that there exists a square Slarge with side of length 5, with corners located at
(0, 0),(0, 5),(5, 0),(5, 5). Consider instances {xi = (ai
, bi) ∈ Slarge} are drawn from a
uniform distribution D over Slarge.
Write down the feature vector for the given instances
x =
3 1
4 1
3 4
4 4
2.5 2.5
2. The true labeling function f : X → Y, where Y = {0, 1}, assigns label 1 to instances
within the smaller square (blue) and 0 to other instances (red). The smaller square
Ssmall is of side length 1 and is located at the center of Slarge. Fig. 1 shows m = 5 data
points used as training data: S = {xi
, i = 0, ..., m}. Formally, write down the training
data S.
S = (((3, 1), 0),((3, 4), 0),((4, 1), 0),((4, 4), 0),((2.5, 2.5), 1))
3. The hypothesis for prediction hl
, is defined over S as
h(x) = (
1 if ∃i ∈ [0, m] such that xi = x and for xi = (ai
, bi), ai l
0 otherwise
(a) What is the training error and true error?
training error true error
h2
4
5
or 0.8
1
25
h3
3
5
or 0.6
1
25
h5
1
5
or 0.2
1
25
(b) Using empirical risk minimization, choose the best hypothesis. Formally show that
this choice is, indeed, the best. Do any of the above hypotheses suffer from overfitting?
According to the principle of empirical risk minimization, the hypothesis with the lowest training error or empirical risk would be best, which in this case would be h5.
However, the true error are the same for all, even if the training data seems to indicate
h5 is the best.
None of the hypotheses suffer from overfitting since they are not fitted to the sample set.
1
4. Consider instances of X drawn from the uniform distribution D on [−1, 1]. Let f
denote the actual labelling function mapping each instance to its label y ∈ {0, 1} with
probabilities
P(Y = 1|x 0) = 0.9
P(Y = 0|x 0) = 0.1
P(Y = 1|x < 0) = 0.1
P(Y = 0|x < 0) = 0.9
The hypothesis h predicts the label for each instance as defined below
h(x) = (
1 if x0
0 otherwise
Measure the success of this predictor by calculating the training error for h.
The training error, Ls(h) = |{i∈[m]:h(xi)6=yi}|
m
.
Ls = P(h(x) 6= Y ∧ x < 0) + P(h(x) 6= Y ∧ x 0)
Ls = P(h(x) 6= Y |x < 0) ∗ P(x < 0) + P(h(x) 6= Y |x 0) ∗ P(x 0)
Ls = 0.1 ∗ 0.5 + 0.1 ∗ 0.5
Ls = 0.1
2 Learning Models
1. Let H be a hypothesis class for a binary classification task. Let mH(?, δ) denote its
sample complexity (depending on the two parameters ? and δ). Show that mH is monotonically non-increasing in each parameter.
The sample complexity function mH(?, δ) ≤ dlog(
|H|
δ
)
?
e.
If we consider the function in terms of only ?, mH =
i
? where i = log(
|H|
δ
)
If we consider the function in terms of only δ, mH = j ∗ log(
k
δ
), where j =
1
?
, k = |H|.
The sample complexity is inversely proportional to both, so as either parameter approaches infinity, mH approaches 0. This can be rationalized - as the need for a more
confidence or a more precise hypothesis decreases, then fewer training samples are
needed.
2. Let X = R
2
, Y = {0, 1}, and let H = hr : r ∈ R+, where hr is the indicator function
hr(x) = 1||x||≤r. Prove that under the realizability assumption, H is PAC learnable.
Also show that its sample complexity is bounded above as follows:
mH(?, δ) ≤ dlog(
1
δ
)
?
e
According to the definition of PAC learnability, a hypothesis is PAC learnable if there
exists a sample complexity function mH : (0, 1)2 → N and an algorithm that when run
on a data set of size m where m ≥ mH(?, δ), the algorithm returns a hypothesis h such
that L(D,f)(h) ≤ ? with probability greater than 1 −δ. ? represents the accuracy of the
2
hypothesis while δ represents the confidence of the hypothesis’ accuracy.
The described hypothesis class consists of a hypothesis that for some positive real, r,
returns 1 when the l2 norm of the input is less than or equal to r, essentially the class
of concentric circles on the plane with radius r with the center at the origin.
Since the domain is for this problem is infinite, for any hypothesis hr with any arbitrary r ∈ R+, the size of the ”error region” (the area difference between the concentric
circles represented by the chosen hypothesis and the labelling function) where misclassifications occur is finite and the true error approximates to 0: LD = 0. So for
any δ, ? ∈ (0, 1), we can choose any hypothesis and it will always be the case that
L(D,f)(hr) ≤ ?. Hence, the hypothesis is PAC learnable under the realizability assumption.
Since any hypothesis with an r ∈ R+ will satisfy the requirements that L(D,f)(hr) ≤ ?,
no training samples are required to pick a suitable hypothesis. As a result we set
mH = 0. To complete this proof, we simply need to show 0 ≤ dlog(
1
δ
)
?
e
d
log(
1
δ
)
?
e ≥ log(
1
δ
)
?
Since δ, ? ∈ (0, 1).
log(
1
δ
)
?
≥
0
?
= 0
So
0 = mH ≤ dlog(
1
δ
)
?
e
3. Let X be a domain and let D1, D2, ..., Dm be a sequence of distributions over X and
let f ∈ H. Suppose we are getting a sample S of m examples such that the instances
are independent but not identically distributed, the i
th instance is sampled from Di
,
and then yi
is set to be f(xi). Let Dm = (D1 + ... + Dm)/m.
Fix an accuracy parameter ? ∈ (0, 1). Show that
P[∃h ∈ H s.t. L(Dm,f )
(h) ? and L(S,f)(h) = 0] ≤ |H|e
−?m
Since H is a finite hypothesis class containing the labelling function, it is PAC learnable.
In other words, there exists an algorithm that can find a hypothesis with a true error
of less than ? with probability 1 − δ when run on m ≥ mH samples where
mH(?, δ) ≤ dlog(
|H|
δ
)
?
e
We will first define a subset of hypotheses that have a true error greater than ?, HB =
{h : LD,f ?}.
3
We also define the set of samples for which the bad hypotheses have an empirical error
of 0 M = {S|x : ∃h ∈ HB, LS = 0}
The set of samples for which the learner returns a bad hypothesis is a subset of M,
{S|x : LD,f ?} ⊆ M
The probability of a bad hypothesis on the size-m sample is lower than the probability of
any of the hypotheses in M being selected. Dm({S|x : LD,f (h) ?}) ≤ Dm(∪h∈H{S|x :
LS(h) = 0})
Since P(a∪b) ≤ P(a)+P(b), we can say Dm({S|x : LD,f (h) ?}) ≤
P
h∈HB Dm({S|x :
LS(h) = 0}). We will refer to this as equation (A) later.
For each sample data point, since it is independently distributed, Dm({S|x : LS(h) =
0}) = Πm
i=0Di({xi
: h(xi) = yi})
For a single data point, since the hypothesis comes from HB = {h : LD,f ?},
Πm
i=0Di({xi
: h(xi) = yi})
Since ?, 1 − ? ∈ (0, 1) → 1 − ? ≤ e
−?
, [Πm
i=0(1 − ?) = (1 − ?)
m] ≤
P
h∈HB Dm({S|x :
LS(h) = 0}) ≤ e
−?m
Plug this into equation A: Dm({S|x : LD,f (h) ?}) ≤ |HB|e
−?m ≤ |H|e
−?m
Since the hypothesis comes from HB: Dm({S|x : LD,f (h) ? and L(S,f)(h) = 0}) ≤
|H|e
−?m
This defines the bounds on the size m sample such that the probability of a hypothesis
with a true error greater than ? and the empirical error is 0.
P[∃h ∈ H s.t. L(Dm,f )
(h) ? and L(S,f)(h) = 0] ≤ |H|e
−?m
4. Let H be a hypothesis class of binary classifiers. Show that if H is agnostic PAC
learnable, then H is PAC learnable as well. Furthermore, if A is a successful agnostic
PAC learner for H, then A is also a successful PAC learner for H.
According to the definition of PAC learnability, a hypothesis is PAC learnable if there
exists a sample complexity function mH : (0, 1)2 → N and an algorithm that for every
?, δ ∈ (0, 1), distribution D over X , and labelling function f : X → {0, 1}, if the
realizability assumption holds with respect to H, D, f, then when running the learning
algorithm on m mH(?, δ) independently and identically drawn from D and labelled
by f, the algorithm returns a hypothesis h such that with probability greater than
1 − δ, L(D,f)(h) ≤ ?.
According to the definition of agnostic PAC learnability, a hypothesis is agnostic PAC
learnable if there exists a sample complexity function mH : (0, 1)2 → N and an algorithm that for every ?, δ ∈ (0, 1), distribution D over X × Y, then when running
the learning algorithm on m mH(?, δ) independently and identically drawn from D,
the algorithm returns a hypothesis h such that with probability greater than 1 − δ,
L(D,f)(h) ≤ minh0∈HLD(h
0
) + ?.
If we know that H is agnostic PAC learnable, there already exists a m
agnostic
H (?, δ) that
generates a minimal sample complexity. As stated in the definition of PAC learnability, if the realizability condition is met, then m
agnostic
H (?, δ) ≤ 0 + ?, which is the
4
same as the sample complexity function for PAC learning. Since PAC learning requires a labelling function and distribution over X while agnostic PAC learning uses
a distribution over Z = X × Y, we let the first terms for all points in Z serve as the
domain X , X = {xi∀zi = (xi
, yi) ∈ Z}, and the labelling function f(x) return yi
if
∃zi = (xi
, yi) ∈ Z, xi = x. If the realizability assumption is not met, it still logically
completes the definition of PAC learnable. Thus, since it is agnostic PAC learnable, it
is also PAC learnable.
If A is a successful agnostic PAC learner for H, then when run on a sample of
m
agnostic
H (?, δ) ≤ 0 + ?, it returns a hypothesis h such that with probability greater
than 1 − δ, L(D,f)(h) ≤ minh0∈HLD(h
0
) + ?. Since we have already proposed a scheme
for creating a labelling function and domain suitable for PAC learning from the domain for A, if the realizability assumption holds, with probability greater than 1 − δ,
L(D,f)(h) ≤ minh0∈HLD(h
0
) + ?. If it doesn’t hold, then it is still by definition a successful PAC learner.
5. Show that for every probability distribution D, the Bayes optimal predictor fD is
optimal, in the sense that for every classifier g from X , LD(fD) ≤ LD(g).
To simplify this, we can break this down in a case-by-case basis.
• When P(y = 1) ≥ 0.5
fD(x) = 1, so P[fD(x) = y] = P(y = 1)
0 ≤ g(x) ≤ 1, so P[g(x) = y] ≤ P(y = 1)
• When P(y = 0) ≥ 0.5
fD(x) = 0, so P[fD(x) = y] = P(y = 0)
0 ≤ g(x) ≤ 1, so P[g(x) = y] ≤ P(y = 0)
For each subset of the outcome space, the probability of success for the Bayes optimal
classifier is greater than or equal to the the probability of success for any other classifier
- P(fD(x) = y) ≥ P(g(x) = y).
Since the true error LD(h) = 1 − P(h = y)
P(fD(x) = y) ≥ P(g(x) = y)
1 − P(fD(x) = y) ≤ 1 − P(g(x) = y)
LD(fD) ≤ LD(g)
3 Learning Algorithms
1. Explain briefly (no more than 3-4 sentences) how the k-nearest neighbors algorithm
can be used for both classification and regression.
The k-nearest neighbors algorithm works using some function called a metric fuction
to identify the degree of difference between any two given data points based on the
domain values. It uses the metric function to identify k points in the training set most
5
similar to the input and assigns an output based on the labels of these ”neighboring”
points. In a classification system, the label may be chosen from neighboring points
based while regression systems may choose an average or center value of the neighbors.
2. Show that the ERM problem of linear regression with respect to the absolute value
loss function l(h,(x, y)) = |h(x) − y| can be solved as a linear program. Show this by
explicity casting the ERM problem as a linear program.
The ERM problem associated with this problem is
argminw
1
m
Xm
i=1
| < w, xi −yi
|
The nature of the absolute value means that this loss function is not differentiable,
making it difficult to minimize. As a result, we introduce a vector u of same size as w.
argminu
1
m
Xm
i=1
ui
, −ui ≤ (< w, xi −yi),(< w, xi −yi) ≤ ui
3. Let us modify the perceptron learning algorithm as follows:
In the update step, instead of setting w
(t+1) = w
t + yixi every time there is a misclassification, we set w
t+1 = w
t + ηyixi
for some fixed η 0. Show that this modified
perceptron will
• perform the same number of iterations as the original, and
• converge to a vector that points to the same direction as the output of the original
perceptron
If the data is separable, the perceptron stops when it has separated the examples.
Let w
∗ be a vector that {∀i ∈ [m],(yi < w∗
, xi ) ≥ 1}
So after some iteration t,
< w∗
, w(t+1) − < w∗
, w(t) =< w∗
, w(t+1) − w
(t)
=< w∗
,(w
t + ηyixi) − w
(t)
=< w∗
, ηyixi
= ηyi < w∗
, xi
≥ η
After T iterations,
[
X
T
t=1
(< w∗
, w(t+1) − < w∗
, wt ) =< w∗
, w(T +1) ] ≥ ηT
6
We will later refer to the above as equation A
We then consider the norms:
||w
(t+1)||2 = ||w
(t) + ηyixi
||2 = ||w
(t)
||2 + 2ηyi < w(t)
, xi +η
2
y
2
i
||xi
||2
if we let R = max||xi
|| and consider the fact that the perceptron only iterates when
[ηyi < w(t)
, xi ] ≤ 0 and η is positive
[||w
(t+1)|| = ||w
(t)
||2 + 2ηyi < w(t)
, xi +η
2
y
2
i
||xi
||2
] ≤ ||w
(t)
||2 + η
2R
2
We apply this after T iterations,
||w
(T +1)||2 ≤ T η2R
2
||w
(T +1)|| ≤ √
T ηR
Combine this with equation A and let B = ||w
∗
||, just as we do when proving the
upper bound for this algorithm using an update step of w
(t+1) = w
t + yixi
.
< w(T +1), w∗
||w∗
||||w(T +1)|| ≥
ηT
Bη√
T R
=
√
T
BR
The Cauchy Shwarz inequality states that the left side is less than 1.
1 ≥
< w(T +1), w∗
||w∗
||||w(T +1)|| ≥
T
B
√
T R
=
√
T
BR
1 ≥
√
T
BR
or equivalently
T ≤ (RB)
2
We have shown that even for values of η 6= 1, the number of iterations T using the perceptron update step w
t+1 = w
t + ηyixi
is still bounded by the same value: T ≤ (RB)
2
,
where R = maxi
||xi
||, B = min{||w|| : ∀i ∈ [m], yi < w, xi ≥ 1}, η 0.
Furthermore, in the process, we have shown that after the final iteration T,
< w∗
, w(T +1)
η =
X
T
t=1
(< w∗
, w(t+1)
η − < w∗
, wt )
=
X
T
t=1
ηyi < w∗
, xi
= T ηX
T
t=1
yi < w∗
, xi
= T η < w∗
, w(T +1)
Since T, η 0, this scalar multiplication of the final vector of the original perceptron
has the same direction.
7