$29
I: Logistic Regression and Na¨ıve Bayes 15 points
Suppose in a binary classification problem, the input variable x is n-dimensional and the ouput is
a binary class label y ∈ Y = {0, 1}. In this situation, there is an interesting connection between
two learners: logistic regression and na¨ıve Bayes classifier.
(a) Write down the expressions for the class conditional probability for each class, i.e., P(y = 1|x)
and P(y = 0|x), for logistic regression.
P(y = 1|x) = 1
1+e−<w,x
P(y = 0|x) = e−<w,x
1+e−<w,x
(b) Using Bayes’ rule, derive the posterior probabilities for each class, i.e., P(y = 1|x) and P(y =
0|x), for na¨ıve Bayes.
P(y = 1|x) = P(x|y=1)P(y=1)
P(x)
P(y = 0|x) = P(x|y=0)P(y=0)
P(x)
(c) Assuming a Gaussian likelihood function in each of the n dimensions, write down the full
likelihood function f(x|d) for na¨ıve Bayes.
f(x|d) = √
1
2πσ
e
−(x−µ)
2
2σ2
(d) Assuming a uniform prior on the two classes and using the results from parts (b) and (c)
above, derive a full expression for P(y = 1|x) for na¨ıve Bayes.
P(y = 1|x) = P(x|y=1)P(y=1)
P(x) = √
1
2πσ
e
−(x−µ)
2
2σ2
(e) Show that with appropriate manipulation and parameterization, P(y = 1|x) in na¨ıve Bayes
from part (d) is equivalent to P(y = 1|x) for logistic regression in part (a).
II: Boosting Confidence 10 points
Let A be an algorithm for which there exists a constant δ0 ∈ (0, 1) and a function mH : (0, 1) → N
such that for every ? ∈ (0, 1), if m mH(?), then for every distribution D it holds that with
probability at least 1 − δ0, LD(A(S)) ≤ minh∈H LD(h) + ?.
Now, consider the following steps:
(i) Divide the data into K +1 chunks where each of the first k chunks consists of mH(?) examples.
(ii) Traing the first k chunks using A.
(iii) Use the last chunk to choose from the k hypotheses that A generated from the k chunks.
Show that this 3-step procedure of using A for agnostic PAC-learning of H has a sample complexity:
mH(?, δ) ≤ k · mH(?) + ?
2 log(4k/δ)
?
2
?
where k = dlog(δ)/ log(δ0)e
Hint: Use Corollary 4.6 from the textbook.
1
If we divide the data into K + 1 chunks of size mH(?) where each is trained using our learning
algorithm, we require k · mH(?) samples to ensure we obtain k independent hypotheses that
are not based on the the same sample set.
With |H| hypotheses, we can apply Corollary 4.6 from the textbook, which states that for a
finite hypothesis class H, domain Z, and a loss function l : H × Z → [0, 1], the class is PAC
learnable with sample complexity mH(?, δ) ≤ d 2log(2|H|/δ)
?
2 e.
If we apply this to the size-k set of hypotheses generated by our k chunks, we determine:
mH(?, δ) ≤ d 2log(2k/δ)
?
2 e
In order to obtain a fair hypothesis independent of the training samples for any of the chunks,
we can upper bound the sample comlexity of this system:
mH(?, δ) ≤ k · m(?) + d
2log(2k/δ)
?
2
e
III: AdaBoost Weights 10 points
In the lectures, we discussed (informally) how the weighting mechanism of AdaBoost helps the
learner to focus on the problematic examples in the next iteration. Here, you have to formally
argue for it:
Show that the error ht w.r.t. D(t+1) is exactly 0.5. That is, show that
∀ t ∈ [T]
X
i∈[m]
D
(t+1)
i 1[yi6=ht(xi)] = 0.5
The update step for adaptive boosting applies the following:
D
t+1
i =
Dt
e
−wtyih(xi)
Pm
j=1 Dt
j
e
−wtyjh(xj )
where wt =
1
2
log(
1
?t
− 1), ? =
XDt
i1[yi6=h(xi)]
so we must show:
X
i∈[m]
Dt
i
e
−wtyih(xi)
Pm
j=1 Dt
j
e
−wtyjh(xj )
= 0.5
IV: Failure of Cross-Validation 10 points
Cross-validation works well in practice, but there are some pathological cases where it might fail.
Suppose that the label is chosen at random according to P[y = 1] = P[y = 0] = 1/2. Consider a
learning algorithm that outputs the constant predictor h(x) = 1 if the number of 1s in the training
set labels is odd, and h(x) = 0 otherwise. Prove that the difference between the leave-one-out
estimate and the true error in such a case is always 1/2.
Honestly I am unable to understand how the difference between the leave-one-out error and
the true error are 1
2
. Cross-validation doesn’t work because simply adding or removing a
training sample from the traning sample set will result in a different hypothesis. Furthermore,
only a constant hypothesis of h(x) = 1 or h(x) = 0 is produced. SInce the labelling function
is P(1) = P(0) = 1
2
, the true error of the hypothesis selected by validation will always be 1
2
.
Assuming the sample is drawn iid, the empirical loss of the sample will also be 1
2
.
V: Local Minima of 0-1 Loss 10 points
Here, you are being asked to construct an example where 0-1 loss suffers from local minima.
Construct a training sample S ∈ (R
d × {±1})
m for which there exists a vector w and some positive
2
scalar ? such that:
(a) ∀ w0
such that ||w − w0
|| ≤ ? we have LS(w) ≤ LS(w0
) (with 0-1 loss). This means that w is
a local minimum of LS.
(b) ∃ w∗
such that LS(w) < LS(w). This means that w is not a global minimum of LS.
A function will have non-global local minima if it is non convex - e.g. if we consider the a
hypothesis class of 1-dimensional halfspaces trying to classify points in a distribution that
follows the form:
f(x) = 1 if a < x < b
f(x) = 0 otherwise
.
So a more specific example would be a uniform distribution of points from 0 to 1 with labelling
function f(x) = 1 if 1
6 < x < 3
6
, else 0. Given a sufficiently sized representative sample drawn
from distribution, there exists no way of dividing the domain into two halfspaces such that
all minima are global.
VI: Bounded Logistic Regression is Lipschitz and Smooth 20 points
Show that logistic regression with the bounded hypothesis class H = X = {x ∈ R
d
: ||x|| ≤
B} (where B is a positive scalar), the label set Y = {±1}, and the loss function l(w,(x, y)) =
log(1 + e−yhw,xi
), is convex, Lipschitz, and smooth. For Lipschitzness and smoothness, specify the
parameters as well.
By claim 12.4 in the textbook, for a function f : R
d → R that can be written as f(w) = g(<
w, x +y) for x ∈ R
d
, y ∈ R, g : R → R - if g is convex, f is convex.
In this case, g(a) = log(1 + e
a
), f = g(−y < w, x ), so the function is convex.
In order to show that the function is Lipschitz, we need to show
∀w1, w2, ∃ρ 0, ||f(w1) − f(w2)|| ≤ ρ||w1 + w2||
||log(1 + e
−y<w1,x) − log(1 + e
−y<w2,x)|| ≤ ρ||w1 + w2||
log(
1 + e
−y<w1,x
1 + e−y<w2,x ) ≤ ρ||w1 + w2||
log(
1
1 + e−y<w2,x +
e
−y<w1,x
1 + e−y<w2,x ) ≤ ρ||w1 + w2||
Each term inside the log of the right side is bounded by 1, so:
log(
1
1 + e−y<w2,x +
e
−y<w1,x
1 + e−y<w2,x ) ≤ log(2)
log(
1
1+e−y<w2,x +
e−y<w1,x
1+e−y<w2,x )
||w1 + w2|| ≤
log(2)
||w1 + w2||
Since ||w1 +w2|| ≤< u, u · < v, v ≤ B ∗B, we claim the function is log(2)
B2 -Lipschitz. Since
a function that is σ-Lipschitz is σ-smooth, we claim it os also log(2)
B2 -smooth.
VII: Unsupervised Learning without Uniform Convergence 25 points
Suppose our hypothesis class H is B, the unit ball in R
d
, the domain is Z = B × {0, 1}
d
, and
3
the loss function l : Z × H → R is defined as l(w,(x, a)) = P
i∈[d]
ai(xi − wi)
2
. This problem
corresponds to an unsupervised learning task, meaning that we do not try to predict the label
of x. Instead, we try to find the center of mass of the distribution over B. The vector a plays an
interesting role: in each pair (x, a), it indicates which features of x are turned on (value is 1) and
which ones are turned off (value is 0). A hypothesis is a vector w representing the center of mass
of the distribution, and the loss function is the squared Euclidean distance between x and w, but
only with respect to the elements of x that are turned on.
(i) Show that this problem is learnable using the RLM rule with a sample complexity that does
not depend on d.
Applying the RLM rule means
P
argminw(R(w)+LS(w)) or in this case argminw(R(w)+
i∈[d]
ai(xi − wi)
2
). Since we know that the hypothesis class refers to the unit ball in d
dimensions, regardless of the number of dimensions, we control the total distance using
the L-2 norm.
(ii) Consider a distribution D ∼ Z as follows: x is fixed to be some x0, and each element of a is
sampled to be either 1 or 0 with equal probability.
(a) Show that if d ? 2m, then there is a high probability of sampling a set of examples such
that there exists some j ∈ [d] for which aj = 0 for all the examples in the training set.
The probability across m d-dimensional vectors that a feature will be 0, given that
the probability any feature will be 0 is P(0) = (0.5)m ∗d. As d increases, the greater
the chances that a feature will be set to 0. Meanwhile, as m increases, the chances
a feature for all samples will be set to 0 goes to 0.
(b) Show that such a sample is not ?-representative.
A training set is ? representative with respect to a domain, hypothesis class, loss
function, and distribution if: ∀h ∈ H, |LS(h) − LD(h)| ≤ ?. This sampling is not ?
representative because it represents all unit balls, not all of which may contain any
of the points in the sample set. In which case, the accuracy of an infinite number
of hypotheses in H is 0.
(c) Conclude that the sample complexity of uniform convergence must grow with d.
As d increases, the loss function increases since we add distances between feature
values across all dimensions. The more dimensions there are, the greater the individual errors are for each data point and more. As a result, greater sample complexity
is required.
(iii) If d is taken to infinity, show that we obtain a problem that is learnable but for which the
uniform convergence property does not hold.
If we take d to infinity, we cannot calculate the loss using the Euclidean distance from
the center of the ball. However, it is learnable with some degree of confidence as long
as one uses an appropriate loss function, perhaps by using a vector’s euclidean distance
from B in a sufficiently large number of dimensions to determine the loss.
4