Starting from:

$30

Homework 3 AdaBoost.

CSC311 
Homework 3

Submission: You need to submit the following files through MarkUs1
:
• Your answers to Questions 1, 2, and 3, and outputs requested for Question 2 and 3, as a PDF
file titled hw3_writeup.pdf. You can produce the file however you like (e.g. LATEX, Microsoft
Word, scanner), as long as it is readable.
• Python file naive_bayes.py completed for Questions 2 and 3.
Neatness Point: One point will be given for neatness. You will receive this point as long as we
don’t have a hard time reading your solutions or understanding the structure of your code.
Late Submission: 10% of the marks will be deducted for each day late, up to a maximum of 3
days. After that, no submissions will be accepted.
Computing: To install Python and required libraries, see the instructions on the course web page.
Homeworks are individual work. See the Course Information handout2
for detailed policies.
1. [7pts] AdaBoost.
(a) [5pts] The goal of this question is to show that the AdaBoost algorithm changes the
weights in order to force the weak learner to focus on difficult data points. Here we
consider the case that the target labels are from the set {−1, +1} and the weak learner
also returns a classifier whose outputs belongs to {−1, +1} (instead of {0, 1}). Consider
the t-th iteration of AdaBoost, where the weak learner is
ht ← argmin
h∈H
X
N
i=1
wiI{h(x
(i)
) 6= t
(i)
},
the w-weighted classification error is
errt =
PN
i=1 wiI{ht(x
(i)
) 6= t
(i)}
PN
i=1 wi
,
and the classifier coefficient is αt =
1
2
log 1−errt
errt
. (Here, log denotes the natural logarithm.) AdaBoost changes the weights of each sample depending on whether the weak
learner ht classifies it correctly or incorrectly. The updated weights for sample i is
denoted by w
0
i
and is
w
0
i ← wi exp
−αtt
(i)ht(x
(i)
)

.
Show that the error w.r.t. (w
0
1
, . . . , w0
N ) is exactly 1
2
. That is, show that
err0
t =
PN
i=1 w
0
i
I{ht(x
(i)
) 6= t
(i)}
PN
i=1 w0
i
=
1
2
.
1
https://markus.teach.cs.toronto.edu/csc311-2020-09
2
http://www.cs.toronto.edu/~rgrosse/courses/csc311_f20/syllabus.pdf
1
CSC311 Fall 2020 Homework 3
Note that here we use the weak learner of iteration t and evaluate it according to the new
weights, which will be used to learn the t+ 1-st weak learner. What is the interpretation
of this result?
Hints:
i. Start from err0
t and divide the summation to two sets of E = {i : ht(x
(i)
) 6= t
(i)}
and its complement Ec = {i : ht(x
(i)
) = t
(i)}.
ii. Note that
P
i∈E wi
PN
i=1 wi
= errt
.
(b) [2pts] Recall from lecture that we can rewrite the 0-1 loss as: I[h(x
(n)
) 6= t
(n)
] =
1
2
(1 − h(x
(n)
) · t
(n)
). Use this identity to prove that the weight update from part (a):
w
0
i ← wi exp
−αtt
(i)ht(x
(i)
)

.
is proportional, up to a constant factor, to weight update:
w
0
i ← wi exp
2αtI{ht(x
(i)
) 6= t
(i)
}

.
What is the constant factor? Since we normalize the weights when computing errt
, these
two updates are equivalent.
2. [13pts] Fitting a Na¨ıve Bayes Model
In this question, we’ll fit a Na¨ıve Bayes model to the MNIST digits using maximum likelihood. The starter code will download the dataset and parse it for you: Each training sample
(t
(i)
, x
(i)
) is composed of a vectorized binary image x
(i) ∈ {0, 1}
784, and 1-of-10 encoded class
label t
(i)
, i.e., t
(i)
c = 1 means image i belongs to class c.
Given parameters π and θ, Na¨ıve Bayes defines the joint probability of the each data point
x and its class label c as follows:
p(x, c | θ,π) = p(c | θ,π)p(x | c, θ,π) = p(c |π)
Y
784
j=1
p(xj | c, θjc).
where p(c |π) = πc and p(xj = 1 | c, θ,π) = θjc. Here, θ is a matrix of probabilities for each
pixel and each class, so its dimensions are 784 × 10, and π is a vector with one entry for
each class. (Note that in the lecture, we simplified notation and didn’t write the probabilities
conditioned on the parameters, i.e. p(c|π) is written as p(c) in lecture slides).
For binary data (xj ∈ {0, 1}), we can write the Bernoulli likelihood as
p(xj | c, θjc) = θ
xj
jc (1 − θjc)
(1−xj )
, (0.1)
which is just a way of expressing p(xj = 1|c, θjc) = θjc and p(xj = 0|c, θjc) = 1 − θjc in
a compact form. For the prior p(t |π), we use a categorical distribution (generalization of
Bernoulli distribution to multi-class case),
p(tc = 1 |π) = p(c |π) = πc or equivalently p(t |π) = Π9
j=0π
tj
j where X
9
i=0
πi = 1,
2
CSC311 Fall 2020 Homework 3
where p(c |π) and p(t |π) can be used interchangeably. You will fit the parameters θ and π
using MLE and MAP techniques. In both cases, your fitting procedure can be written as a
few simple matrix multiplication operations.
(a) [3pts] First, derive the maximum likelihood estimator (MLE) for the class-conditional
pixel probabilities θ and the prior π. Derivations should be rigorous.
Hint 1: We saw in lecture that MLE can be thought of as ‘ratio of counts’ for the data,
so what should ˆθjc be counting?
Hint 2: Similar to the binary case, when calculating the MLE for πj for j = 0, 1, ..., 8,
write p(t
(i)
|π) = Π9
j=0π
t
(i)
j
j
and in the log-likelihood replace π9 = 1 − Σ
8
j=0πj , and then
take derivatives w.r.t. πj . This will give you the ratio ˆπj/πˆ9 for j = 0, 1, .., 8. You know
that ˆπj ’s sum up to 1.
(b) [1pt] Derive the log-likelihood log p(t|x, θ,π) for a single training image.
(c) [3pt] Fit the parameters θ and π using the training set with MLE, and try to report the
average log-likelihood per data point 1
N Σ
N
i=1 log p(t
(i)
|x
(i)
, θˆ,πˆ), using Equation (0.1).
What goes wrong? (it’s okay if you can’t compute the average log-likelihood here).
(d) [1pt] Plot the MLE estimator θˆ as 10 separate greyscale images, one for each class.
(e) [2pt] Derive the Maximum A posteriori Probability (MAP) estimator for the classconditional pixel probabilities θ, using a Beta(3, 3) prior on each θjc. Hint: it has
a simple final form, and you can ignore the Beta normalizing constant.
(f) [2pt] Fit the parameters θ and π using the training set with MAP estimators from previous part, and report both the average log-likelihood per data point, 1
N Σ
N
i=1 log p(t
(i)
|x
(i)
, θˆ,πˆ),
and the accuracy on both the training and test set. The accuracy is defined as the fraction of examples where the true class is correctly predicted using ˆc = argmaxc
log p(tc =
1|x, θˆ,πˆ).
(g) [1pt] Plot the MAP estimator θˆ as 10 separate greyscale images, one for each class.
3. [4pts] Generating from a Na¨ıve Bayes Model
Defining a joint probability distribution over the data lets us generate new data, and also lets
us answer all sorts of queries about the data. This is why these models are called Generative
Models. We will use the Na¨ıve Bayes model trained in previous question to generate data.
(a) [1pt] True or false: Given this model’s assumptions, any two pixels xi and xj where
i 6= j are independent given c.
(b) [1pt] True or false: Given this model’s assumptions, any two pixels xi and xj where
i 6= j are independent after marginalizing over c.
(c) [2pts] Using the parameters fit using MAP in Question 1, produce random image samples from the model. That is, randomly sample and plot 10 binary images from the
marginal distribution p(x|θˆ,πˆ). Hint: To sample from p(x | θˆ,πˆ), first sample random
variable c from p(c |πˆ) using np.random.choice, then depending on the value of c,
sample xj from p(xj | c, ˆθjc) for j = 1, ..., 784 using np.random.binomial(1,..). These
functions can take matrix probabilities as input, so your solution to this part should be
a few lines of code.
3
CSC311 Fall 2020 Homework 3
(d) [Optional] One of the advantages of generative models is that they can handle missing data, or be used to answer different sorts of questions about the model. Derive
p(xbottom|xtop, θ,π), the marginal distribution of a single pixel in the bottom half of
an image given the top half, conditioned on your fit parameters. Hint: you’ll have to
marginalize over c.
(e) [Optional] For 20 images from the training set, plot the top half the image concatenated
with the marginal distribution over each pixel in the bottom half. i.e. the bottom half of
the image should use grayscale to represent the marginal probability of each pixel being
1 (darker for values close to 1).
4

More products