$30
[CMPUT 466/566, Fall 2021] Introduction to Machine learning
Coding Assignment 2
Submission Information
The student should submit a zip file containing a pdf report and all the code which should replicate the results.
Problem 1 [50%]
Download: Codebase
In this problem, you are asked to apply linear regression and logistics regression for binary classification. In
particular, this problem shows that linear regression is a bad model for classification problems.
We consider a binary classification problem, where the input is a two-dimensional vector and the output is
{0,1}. In other words, and . Specifically, we would train our classifiers on the following two
datasets:
Dataset A:
In this dataset, positive samples are generated by a bivariate normal distribution , where
and . Negative samples are generated by another , where
. The covariance is the same as the positive samples.
We have 400 samples in total, where 200 are positive and 200 are negative. The dataset is plotted in the left
panel below.
Dataset B: We now construct a new dataset by shifting half of the positive (blue) samples to the upper right
as shown in the right panel. In other words, the positive samples are generated by and
with equal probability, where .
Dataset A Dataset B
Questions:
For each of Dataset A or Dataset B:
(1) Train a classifier by thresholding a linear regression model. In other words, treat the target 0/1 labels as
real numbers, and classify a sample as positive if the predicted value is greater than or equal to 0.5.
(2) Train a logistic regression classifier on the same data.
Submission (four numbers and four plots):
Report the following four numbers
1. Training accuracy of linear regression on Dataset A
2. Training accuracy of logistic regression on Dataset A
3. Training accuracy of linear regression on Dataset B
4. Training accuracy of logistic regression on Dataset B
and plot four decision boundaries. Please make clear which classifier is applied in the plot.
Problem 2 [50%]
Download: Codebase
In this coding problem, we will implement the softmax regression for multi-class classification using the MNIST
dataset.
Dataset
First, download the datasets from the link above. You need to unzip the .gz file by either double clicking or
some command like gunzip -k file.gz
The dataset contains 60K training samples, and 10K test samples. Again, we split 10K from the training
samples for validation. In other words, we have 50K training samples, 10K validation samples, and 10K test
samples. The target label is among {0, 1, …, 9}.
Algorithm
We will implement stochastic gradient descent (SGD) for cross-entroy loss of softmax as the learning
algorithm. The measure of success will be the accuracy (i.e., the fraction of correct predictions).
The general framework for this coding assignment is the same as SGD for linear regression, so you may
re-use most of the code. However, you shall change the computation of output, the loss function, the measure
of success, and the gradient whenever needed.
Implementation trick
For softmax classification, you may encounter numerical overflow if you just follow the equation mentioned in
the lecture.
The observation is that the exp function increases very fast with its input, and very soon exp(z) will give NAN
(not a number).
The trick is to subtract every by the maximum value .
In other words, we compute
, where , and then we have
Note that the gradient is computed with y, and since subtracting a constant before softmax doesn’t affect y, it
doesn’t affect the gradient either.
[40 marks]
Without changing the the default hyperparameters, we report three numbers:
1. The number of epoch that yields the best validation performance,
2. The validation performance (accuracy) in that epoch, and
3. The test performance (accuracy) in that epoch.
and two plots:
1. The learning curve of the training cross-entropy loss, and
2. The learning curve of the validation accuracy.
[10 marks]
Ask one meaningful scientific question yourself, design your experimental protocol, present results, and draw a
conclusion.
Note:
A scientific question means that we can give a verifiable hypothesis that can be either confirmed or declined.
Example of a scientific question: Is the learned classifier for this dataset better than majority guess?
Your hypothesis could be either yes or no, and it can be verified by experiments.
Example of a non-scientific question: Does the learned classifier become better if I have super-power?
My hypothesis could be either yes or no, but cannot be verified by any experiment. I don’t know what
superpower is, and I can say yes, or I can also say no. Neither is wrong, nor even correct.
A meaningful scientific question means that you’ll learn something from the experiment. Of course, what is
meaningful itself is subjective. In terms of this coding assignment, the scientific question is considered
meaningful as long as the student would learn something, or verify some results we mentioned in lectures.
Example of a meaningful scientific question: How does linear regression perform for classification for
this dataset? Doing this experiment will give us first-hand experience on why we shall not use
regression models to do classification. But you cannot ask this question as the solution. You need to
ask your own scientific question that interests you and/or inspires others.
Example of a not-so-meaningful scientific question: Is the learned classifier for this dataset better than
majority guess? Ok, this question is considered scientific, but too trivial for us, although it is not
necessarily trivial for those who don’t know machine learning at all. [Again this shows the subjectivity of
evaluating the significance of science.]