Starting from:

$35

CS 446 / ECE 449 — Homework 3

CS 446 / ECE 449 — Homework 3

Version 1.0
Instructions.

• Everyone must submit individually at gradescope under hw3 and hw3code.
• The “written” submission at hw3 must be typed, and submitted in any format gradescope accepts
(to be safe, submit a PDF). You may use LATEX, markdown, google docs, MS word, whatever you like;
but it must be typed!
• When submitting at hw3, gradescope will ask you to mark out boxes around each of your answers;
please do this precisely!

• When submitting to hw3code, only upload hw3.py and hw3 utils.py. Additional files will be ignored.
Version History.
1. Initial version.
1
1. Ensemble Methods
In this question, you will implement several ensemble methods including Bagging and AdaBoost on a
simple dataset. The methods will learn a binary classification of 2D datapoints in [−1, 1]2
.
We have provided a few helper functions in hw3 utils.py.
• visualization() visualizes a dataset and the ensemble’s predictions in 2D space.
• get dataset fixed() returns a simple dataset with pre-defined datapoints. Please use this dataset
for the plot.
• get dataset random() returns a simple dataset by random construction. You may play with it and
test your algorithm.
You will need to implement functions and classes defined in hw3.py. When uploading to Gradescope,
please pack the two files hw3 utils.py and hw3.py (without folder) together into one zip file.
(a) Weak learner
To begin with, you will implement a weak learner to do the binary classification.
A decision stump is a one-level decision tree. It looks at a single feature, and then makes a classification
by thresholding on this feature. Given a dataset with positive weights assigned to each datapoint, we
can find a stump that minimizes the weighted error:
L =
Xn
i=1
w
(i)
· 1(y
(i)
6= ˆy
(i)
)
Here w
(i)
is the weight of the i-th datapoint, and the prediction yˆ
(i)
is given by thresholding on the
k-th feature of datapoint x
(i)
:

(i) =
(
s, if x
(i)
k ≥ t
−s, otherwise
For the 2D dataset we have, the parameters of this stumps are the sign s ∈ {+1, −1}, the feature
dimension k ∈ {1, 2}, and the threshold t ∈ [−1, 1]. In this question, your task is to find out the best
stump given the dataset and weights.
Learning a decision stump requires learning a threshold in each dimension and then picking the best
one. To learn a threshold in a dimension, you may simply sort the data in the chosen dimension, and
calculate the loss on each candidate threshold. Candidates are midpoints between one point and the
next, as well as the boundaries of our range of inputs.
Please implement the Stump class in hw3.py. You may define your own functions inside the class,
but do not change the interfaces of init () and predict(). Please read template file for further
instructions.
(b) Weak learner’s predictions
Now test your implementation of Stump on the dataset given by get dataset fixed(). Suppose all
the datapoints are equally weighted. Please answer the following questions in your written submission:
• What is your decision function?
• How many datapoints are mis-classified?
• Using the helper function visualization(), include a visualization of your stump’s predictions.
(c) Bagging
As we have learned from the class, we can utilize ensemble methods to create a strong learner from
weak learners we have for part (a). Please complete bagging() in hw3.py. This function should take
the whole dataset as input, and sample a subset from it in each step, to build a list of different weak
learners.
Please do not change the random sampling of sample indices, and use the default random seed=0,
so that your code can behave consistently in the autograder.
2
(d) AdaBoost
Now please implement AdaBoost algorithm. As we have learned in class, in each step of AdaBoost, it
• Finds the optimal weak learner according to current data weights
• Acquires the weak learner’s predictions
• Calculates the weight for this weak learner
• Updates the weights for datapoints
Complete adaboost() in hw3.py. It should return a series of weak learners and their weights.
(e) Visualization
Run your Bagging and AdaBoost algorithms on the fixed dataset given by get dataset fixed().
Set the number of weak classifiers to 20, and for Bagging, set the number of samples to 15 for learning
each classifier. Please answer the following questions in your written submission:
• Are they performing better than the individual weak learner in (b)?
• Include visualizations for both algorithms in your written submission.
Solution.
3
2. Learning Theory.
(a) VC Dimensions. In this problem, we’ll explore VC dimensions! First, a few definitions that we
will use in this problem. For a feature space X , let F be a set of binary classifier of the form
f : X → {0, 1}. F is said to shatter a set of k distinct points {x
(i)}
k
i=1 ⊂ X if for each set of label
assignments (y
(i)
)
k
i=1 ∈ {0, 1}
k
to these points, there is an f ∈ F which makes no mistakes when
classifying D.
The VC Dimension of F is the largest non-negative integer k ∈ such that there is a set of k points
that F can shatter. Even more formally, let V C(F) denote the VC Dimension of F. It can be defined
as:
V C(F) = max
k
k s.t. ∃{x
(i)
}
k
i=1 ⊂ X , ∀(y
(i)
)
k
i=1 ∈ {0, 1}
k
, ∃f ∈ F, ∀i : f(x
(i)
) = y
(i)
The intuition here is that VC dimension captures some kind of complexity or capacity of a set of
functions F.
Note: The straightforward proof strategy to show that the VC dimension of a set of classifiers is k
is to first show that for a set of k points, the set is shattered by the set of classifiers. Then, show
that any set of k + 1 points cannot be shattered. You can do that by finding an assignment of labels
which cannot be correctly classified using F.
Notation: We denote 1condition(·) : X → {0, 1} to be the indicator function, i.e., 1condition(x) = 1 if
the condition is true for x and 1condition(x) = 0 otherwise.
We will now find the VC dimension of some basic classifiers.
i. 1D Affine Classifier
Let’s start with a fairly simple problem. Consider X = R and Y = {0, 1}. Affine classifiers are of
the form:
Faffine = {1wx+b≥0(·) : X → R | w, b ∈ R},
Show what is V C(Faffine) and prove your result.
Hint: Try less than a handful of points.
ii. General Affine Classifier
We will now go one step further. Consider X = R
k
for some dimensionality k ≥ 1, and Y = {0, 1}.
Affine classifiers in k dimensions are of the form
F
k
affine = {1w>x+b≥0
(·) : X → R | w ∈ R
k
, b ∈ R}
Show what is V C(F
k
affine) and prove your result.
Hint: Note that w>x + b can be written as [x
> 1] 
w
b

. Moreover, consider to put all data
points into a matrix, e.g.,
X =




(x
(1))
> 1
(x
(2))
> 1
.
.
.
.
.
.




.
iii. Decision Trees
Consider X = R
k
for some dimensionality k ≥ 1, and Y = {0, 1}. Show that the VC dimension
of the axis-aligned (coordinate-splits) decision trees is infinite.
Hint: Consider an arbitrary dataset, and show that a decision tree can be constructed to exactly
fit that dataset.
(b) Rademacher Complexity. Recall from class that the generalization error bound scales with the
complexity of the function class F, which, in turn, can be measured via Rademacher complexity. In
this question we will compute the Rademacher complexity of linear functions step by step. Let’s
consider a dataset {x
(i)}
n
i=1 ⊂ R
k with the norm bounded by kx
(i)k2 ≤ R and the set of linear
classifiers F = {x 7→ w>x | w ∈ R
k
, kwk2 ≤ W}.
4
i. For a fixed sign vector  = (1, ..., n) ∈ {±1}
n show that:
max
f∈F
1
n
Xn
i=1
if(x
(i)
) ≤ Wkxk2
where x is defined as x =
1
n
Pn
i=1 x
(i)
i
.
Hint: Cauchy-Schwarz inequality.
ii. Assume i
is distributed i.i.d. according to Pr[i = +1] = Pr[i = −1] = 1/2. Show that
E
h
kxk
2
i

R2
n
iii. Assume i follows the same distribution as previous problem. Recall the definition of Rademacher
complexity:
Rad(F) = E

max
f∈F
1
n
Xn
i=1
if(x
(i)
)


Show that the Rademacher complexity of the set of linear classifiers is:
Rad(F) ≤
RW

n
Hint: Jensen’s inequality.
Solution.
5

More products