$29.99
CSC2515
Homework 2
Submission: You need to submit three files through MarkUs1
:
• Your answers to Questions 1, 2, and 3, as a PDF file titled hw2_writeup.pdf. You can
produce the file however you like (e.g. LATEX, Microsoft Word, scanner), as long as it is
readable.
• Your code for Question 1, as the Python file q1.py.
• Your completed code for Question 2, as the Python file q2.py.
If you wish to write the code in a Jupyter notebook instead, then please submit a PDF printout of
the notebook, rather than the notebook itself.
Late Submission: 10% of the marks will be deducted for each day late, up to a maximum of 3
days. After that, no submissions will be accepted.
Computing: To install Python and required libraries, see the instructions on the course web page.
Collaboration: Homeworks are individual work. See the course web page for detailed policies.
1. [3pts] Robust Regression. One problem with linear regression using squared error loss
is that it can be sensitive to outliers. Another loss function we could use is the Huber loss,
parameterized by a hyperparameter δ:
Lδ(y, t) = Hδ(y − t)
Hδ(a) = (
1
2
a
2
if |a| ≤ δ
δ(|a| − 1
2
δ) if |a| > δ
(a) [1pt] Sketch the Huber loss Lδ(y, t) and squared error loss LSE(y, t) = 1
2
(y − t)
2
for
t = 0, either by hand or using a plotting library. Based on your sketch, why would you
expect the Huber loss to be more robust to outliers?
(b) [1pt] Just as with linear regression, assume a linear model:
y = w>x + b.
Give formulas for the partial derivatives ∂Lδ/∂w and ∂Lδ/∂b. (We recommend you find
a formula for the derivative H0
δ
(a), and then give your answers in terms of H0
δ
(y − t).)
(c) [1pt] Write Python code to perform (full batch mode) gradient descent on this model.
Assume the training dataset is given as a design matrix X and target vector y. Initialize
w and b to all zeros. Your code should be vectorized, i.e. you should not have a for
loop over training examples or input dimensions. You may find the function np.where
helpful.
Submit your code as q1.py.
1
https://markus.teach.cs.toronto.edu/csc2515-2019-09
1
CSC2515 Fall 2019 Homework 2
2. [5pts] Locally Weighted Regression.
(a) [2pts] Given {(x
(1), y(1)), ..,(x
(N)
, y(N)
)} and positive weights a
(1), ..., a(N)
show that
the solution to the weighted least squares problem
w∗ = arg min
1
2
X
N
i=1
a
(i)
(y
(i) − wT x
(i)
)
2 +
λ
2
||w||2
(1)
is given by the formula
w∗ =
XT AX + λI
−1 XT Ay (2)
where X is the design matrix (defined in class) and A is a diagonal matrix where Aii =
a
(i)
It may help you to review Section 3.1 of the csc321 notes2
.
(b) [2pts] Locally reweighted least squares combines ideas from k-NN and linear regression.
For each new test example x we compute distance-based weights for each training example a
(i) =
exp(−||x−x
(i)
||2/2τ
2
P
)
j
exp(−||x−x(j)
||2/2τ
2)
, computes w∗ = arg min 1
2
PN
i=1 a
(i)
(y
(i) − wT x
(i)
)
2 +
λ
2
||w||2 and predicts ˆy = x
T w∗
. Complete the implementation of locally reweighted least
squares by providing the missing parts for q2.py.
Important things to notice while implementing: First, do not invert any matrix, use
a linear solver (numpy.linalg.solve is one example). Second, notice that P
exp(Ai)
j
exp(Aj ) =
P
exp(Ai−B)
j
exp(Aj−B)
but if we use B = maxj Aj it is much more numerically stable as P
exp(Ai)
j
exp(Aj )
overflows/underflows easily. This is handled automatically in the scipy package with the
scipy.misc.logsumexp function3
.
(c) [1pt] Based on our understanding of overfitting and underfitting, how would you expect
the training error and the validation error to vary as a function of τ? (I.e., what do you
expect the curves to look like?)
Now run the experiment. Randomly hold out 30% of the dataset as a validation set.
Compute the average loss for different values of τ in the range [10,1000] on both the
training set and the validation set. Plot the training and validation losses as a function
of τ (using a log scale for τ ). Was your guess correct?
3. [2pts] AdaBoost. The goal of this question is to show that the AdaBoost algorithm changes
the weights in order to force the weak learner to focus on difficult data points. Here we consider
the case that the target labels are from the set {−1, +1} and the weak learner also returns a
classifier whose outputs belongs to {−1, +1} (instead of {0, 1}). Consider the t-th iteration
of AdaBoost, where the weak learner is
ht ← argmin
h∈H
X
N
i=1
wiI{h(x
(i)
) 6= t
(i)
},
the w-weighted classification error is
errt =
PN
i=1 wiI{ht(x
(i)
) 6= t
(i)}
PN
i=1 wi
,
2
http://www.cs.toronto.edu/~rgrosse/courses/csc321_2018/readings/L02%20Linear%20Regression.pdf
3
https://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.misc.logsumexp.html
CSC2515 Fall 2019 Homework 2
and the classifier coefficient is αt =
1
2
log 1−errt
errt
. (Here, log denotes the natural logarithm.)
AdaBoost changes the weights of each sample depending on whether the weak learner ht
classifies it correctly or incorrectly. The updated weights for sample i is denoted by w
0
i
and is
w
0
i ← wi exp
−αtt
(i)ht(x
(i)
)
.
Show that the error w.r.t. (w
0
1
, . . . , w0
N ) is exactly 1
2
. That is, show that
err0
t =
PN
i=1 w
0
i
I{ht(x
(i)
) 6= t
(i)}
PN
i=1 w0
i
=
1
2
.
Note that here we use the weak learner of iteration t and evaluate it according to the new
weights, which will be used to learn the t + 1-st weak learner. What is the interpretation of
this result?
Tips:
• Start from err0
t and divide the summation to two sets of E = {i : ht(x
(i)
) 6= t
(i)} and its
complement Ec = {i : ht(x
(i)
) = t
(i)}.
• Note that
P
i∈E wi
PN
i=1 wi
= errt
.
3