$30
CSC311
Homework 2
.
Submission: You need to submit five files through MarkUs1
:
• Your answers to Questions 1, 2, 3, and 4, as a PDF file titled hw2_writeup.pdf. You can
produce the file however you like (e.g. LATEX, Microsoft Word, scanner), as long as it is
readable.
• Python files run_knn.py, logistic.py, and run_logistic_regression.py completed for
Question 3.
• Python files q4.py completed for Question 4.
Neatness Point: One point will be given for neatness. You will receive this point as long as we
don’t have a hard time reading your solutions or understanding the structure of your code.
Late Submission: 10% of the marks will be deducted for each day late, up to a maximum of 3
days. After that, no submissions will be accepted.
Computing: To install Python and required libraries, see the instructions on the course web page.
Homeworks are individual work. See the Course Information handout2
for detailed policies.
1. [9pts] Expected Loss and Bayes Optimality You are running an email service, and
one of your key features is a spam filter. Every email is either spam or non-spam, which
we represent with the target t ∈ {Spam, NonSpam}. You need to decide whether to keep it
in the inbox or remove it to the spam folder. We represent this with the decision variable
y ∈ {Keep, Remove}. We’d like to remove spam emails and keep non-spam ones, but the
customers will be much more unhappy if we remove a non-spam email than if we keep a spam
email. We can represent this with the following loss function L(y, t):
NonSpam Spam
Keep 0 1
Remove 100 0
Your studies indicate that 10% of the emails are spam, i.e. Pr(t = Spam) = 0.1.
(a) [2pts] Evaluate the expected loss E[L(y, t)] for the policy that keeps every email (y =
Keep), and for the policy that removes every email (y = Remove).
(b) [2pts] Now suppose you get to observe a feature vector x for each email, and using
your knowledge of the joint distribution p(x, t), you infer p(t| x). What is the Bayes
optimal decision rule? I.e., say how to determine the Bayes optimal decision y∗ from the
conditional probability Pr(t = Spam | x).
1
https://markus.teach.cs.toronto.edu/csc311-2021-09
2
http://www.cs.toronto.edu/~rgrosse/courses/csc311_f21/syllabus.pdf
1
CSC311 Fall 2021 Homework 2
(c) [4pts] After some analysis, you’ve found two words that are indicative of an email being
spam: “Viagra” and “Nigeria”. You define two input features: x1, which takes the value
1 if the email contains the word “Viagra” and 0 otherwise, and x2, which is the same
for “Nigeria”. For a spam email, these features have the following joint distribution:
x2 = 0 x2 = 1
x1 = 0 0.4 0.3
x1 = 1 0.2 0.1
For a non-spam email, these features have the following joint distribution:
x2 = 0 x2 = 1
x1 = 0 0.998 0.001
x1 = 1 0.001 0
Determine the Bayes optimal decision rule, i.e. determine the Bayes optimal decision y∗
for each value of x. Justify your answer.
(d) [1pts] What is the expected loss E[L(y∗, t)]?
2. [4pts] Feature Maps. Suppose we have the following 1-D dataset for binary classification:
x t
-1 1
1 0
3 1
(a) [2pts] Argue briefly (at most a few sentences) that this dataset is not linearly separable.
(Your argument should resemble the one we used in lecture to prove XOR is not linearly
separable.)
(b) [2pts] Now suppose we apply the feature map
ψ(x) =
ψ1(x)
ψ2(x)
=
x
x
2
.
Assume we have no bias term, so that the parameters are w1 and w2. Write down the
constraint on w1 and w2 corresponding to each training example, and then find a pair
of values (w1, w2) that correctly classify all the examples. Remember that there is no
bias term.
3. [15pts] kNN vs. Logistic Regression. In this problem, you will compare the performance
and characteristics of different classifiers, namely k-Nearest Neighbors and Logistic Regression. You will complete the provided code in q3/ and experiment with the completed code.
You should understand the code instead of using it as a black box.
The data you will be working with is a subset of MNIST hand-written digits, 4s and 9s, represented as 28×28 pixel arrays. We show the example digits in figure 1. There are two training
sets: mnist_train, which contains 80 examples of each class, and mnist_train_small, which
contains 5 examples of each class. There is also a validation set mnist_valid that you should
use for model selection, and a test set mnist_test that you should use for reporting the final
performance. Optionally, the code for visualizing the datasets is located at plot_digits.py.
2
CSC311 Fall 2021 Homework 2
Figure 1: Example digits. Top and bottom show digits of 4s and 9s, respectively.
3.1. k-Nearest Neighbors. Use the supplied kNN implementation to predict labels for
mnist_valid, using the training set mnist_train.
(a) [2pts] Implement a function run_knn in run_knn.py that runs kNN for different
values of k ∈ {1, 3, 5, 7, 9} and plots the classification rate on the validation set
(number of correctly predicted cases, divided by total number of data points) as a
function of k. Report the plot in the write-up.
(b) [2pts] Comment on the performance of the classifier and argue which value of k you
would choose. What is the classification rate for k
∗
, your chosen value of k? Also
report the classification rate for k
∗ + 2 and k
∗ − 2. How does the test performance
of these values of k correspond to the validation performance3
?
3.2. Logistic Regression. Read the provided code in run_logistic_regression.py and
logistic.py. You need to implement the logistic regression model, where the cost is
defined as:
J =
1
N
X
N
i=1
LCE(y
(i)
, t(i)
) = 1
N
X
N
i=1
−t
(i)
log y
(i) − (1 − t
(i)
) log(1 − y
(i)
)
,
where N is the total number of data points.
(a) [4pts] Implement functions logistic_predict, evaluate, and logistic located
at logistic.py.
(b) [5pts] Complete the missing parts in a function run_logistic_regression located
at run_logistic_regression.py. You may use the implemented functions from
part (a). Run the code on both mnist_train and mnist_train_small. Check
whether the value returned by run_check_grad is small to make sure your implementation in part (a) is correct. Experiment with the hyperparameters for the
learning rate, the number of iterations (if you have a smaller learning rate, your
model will take longer to converge), and the way in which you initialize the weights.
If you get NaN/Inf errors, you may try to reduce your learning rate or initialize with
3
In general, you shouldn’t peek at the test set multiple times, but we do this for this question as an illustrative
exercise.
3
CSC311 Fall 2021 Homework 2
smaller weights. For each dataset, report which hyperparameter settings you found
worked the best and the final cross entropy and classification error on the training,
validation, and test sets. Note that you should only compute the test error once you
have selected your best hyperparameter settings using the validation set.
(c) [2pts] Examine how the cross entropy changes as training progresses. Generate and
report 2 plots, one for each of mnist_train and mnist_train_small. In each plot,
you need show two curves: one for the training set and one for the validation set.
Run your code several times and observe if the results change. If they do, how would
you choose the best parameter settings?
4. [8pts] Locally Weighted Regression.
(a) [2pts] Given {(x
(1), y(1)), ..,(x
(N)
, y(N)
)} and positive weights a
(1), ..., a(N)
show that
the solution to the weighted least squares problem
w∗ = arg min
1
2
X
N
i=1
a
(i)
(y
(i) − wT x
(i)
)
2 +
λ
2
||w||2
(1)
is given by the formula
w∗ =
XT AX + λI
−1 XT Ay (2)
where X is the design matrix (defined in class) and A is a diagonal matrix where Aii =
a
(i)
(b) [2pts] Locally reweighted least squares combines ideas from k-NN and linear regression.
For each new test example x we compute distance-based weights for each training example a
(i) =
exp(−||x−x
(i)
||2/2τ
2
P
)
j
exp(−||x−x(j)
||2/2τ
2)
, computes w∗ = arg min 1
2
PN
i=1 a
(i)
(y
(i) − wT x
(i)
)
2 +
λ
2
||w||2 and predicts ˆy = x
T w∗
. Complete the implementation of locally reweighted least
squares by providing the missing parts for q4.py.
Important things to notice while implementing: First, do not invert any matrix, use
a linear solver (numpy.linalg.solve is one example). Second, notice that P
exp(Ai)
j
exp(Aj ) =
P
exp(Ai−B)
j
exp(Aj−B)
but if we use B = maxj Aj it is much more numerically stable as P
exp(Ai)
j
exp(Aj )
overflows/underflows easily. This is handled automatically in the scipy package with the
scipy.misc.logsumexp function4
.
(c) [2pt] Randomly hold out 30% of the dataset as a validation set. Compute the average
loss for different values of τ in the range [10,1000] on both the training set and the
validation set. Plot the training and validation losses as a function of τ (using a log
scale for τ ).
(d) [2pt] How would you expect this algorithm to behave as τ → ∞? When τ → 0? Is this
what actually happened?
4
https://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.misc.logsumexp.html