Starting from:

$30

Assignment 4: Supervised Machine Learning Algorithms

Homework Assignment 4
COGS 118A: Supervised Machine Learning Algorithms

Instructions: Combine your answer to the questions below with your notebook to
create a PDF file; submit your file via Gradescope.
1 (8 points) Conceptual Questions
1. Choose all the valid answers to the description about linear regression and
logistic regression from the options below:
A. Linear regression is an unsupervised learning problem; logistic regression is
a supervised learning problem.
B. Linear regression deals with the prediction of continuous values; logistic
regression deals with the prediction of class labels.
C. We cannot use gradient descent to solve linear regression. Instead, we can
only use the closed-form solution to tackle the linear regression problem.
D. Linear regression is a convex optimization problem whereas logistic regression
is not.
2. Choose all the valid answers to the description about gradient descent from
the options below:
A. The global minimum is guaranteed to be reached by using gradient descent.
B. Every gradient descent iteration can always decrease the value of loss function
even when the gradient of the loss function is zero.
C. When the learning rate is very large, it is possible that some iterations of
gradient descent may not decrease the value of loss function.
D. With different initial weights, it is possible for the gradient descent algorithm
to obtain to different local minimum.
1
2 (12 points) Error Metrics
The grid below shows 5 × 4 = 20 possible locations for targets to appear. We make
guesses at the locations where targets are located, and mark those cells in blue. For
the rest of the locations, we guess that those locations are non-targets, which are
marked in white. The actual locations for the target are marked with a circle in cells,
while the actual locations for non-target are marked with empty cells. Evaluate the
following metrics for your guesses.
1. Compute the Recall using the following formula:
Recall = number of true positives (correct guesses)
number of actual targets
2. Compute the Precision using the following formula:
Precision = number of true positives (correct guesses)
number of guesses
3. Compute the F1 metric on this data. F1 is a way to measure classification
accuracy. It’s the harmonic mean of the precision and recall. F1 value is
calculated using the following formula:
F1 =
2 * Precision * Recall
Precision + Recall
3 (10 points) Logistic Regression Inference
We have a logistic regression classifier for a 2-dimensional feature vector x = (x1, x2):
p(y = +1|x) = 1
1 + e
−(2x1−x2+1)
f(x) = (
1, p(y = +1|x) ≥ 0.5,
−1, otherwise.
where f(x) ∈ {−1, 1} is the predicted label. Please solve the following problems.
1. Draw the decision boundary of the logistic regression classifier and shade the
region where the classifier predicts 1. Make sure you have marked the x1 and
x2 axes and the intercepts on those axes.
2. There are two new data points x1 = (0.5, 3) and x2 = (0.5, −2). Use the logistic
regression defined above to:
a. Calculate the probability that each data point belongs to the positive class.
You may use a calculator or computer to get the numeric answer.
b. Predict the labels for the new data points.
4 (40 points) Logistic Regression
Assume in a binary classification problem, we need to predict a binary label y ∈
{−1, 1} for a feature vector x = [x0, x1]
. In logistic regression, we can reformulate
the binary classification problem in a probabilistic framework: We aim to model the
distribution of classes given the input feature vector x. Specifically, we can express
the conditional probability p(y|x) parameterized by (w, b) using logistic function as:
p(y|x) = 1
1 + e
−y(w·x+b)
(1)
where w = [w0, w1]
is the weight vector, and b is the bias scalar. Given a training
dataset Straining = {(xi
, yi)}, i = 1, . . . , n}, we wish to optimize the negative loglikelihood loss L(w, b):
L(w, b) = −
Xn
i=1
ln p(yi
|xi) (2)
and find the optimal weight vector w and bias b to build the logistic regression model:
w

, b∗ = arg min
w,b
L(w, b) (3)
• In this problem, we attempt to obtain the optimal parameters w∗ and b
∗ by
using a standard gradient descent algorithm. Assume pi = p(yi
|xi), the gradient
for w and the gradient for b are shown as following:
∂L(w, b)
∂w
= −
Xn
i=1
(1 − pi)yixi
,
∂L(w, b)
∂b = −
Xn
i=1
(1 − pi)yi
. (4)
In reality, we typically tackle this problem in a matrix form: First, we represent
data points as matrices X = [x1, x2, . . . , xn]
T and Y = [y1, y2, . . . , yn]
T
. Thus,
the negative log-likelihood loss L(w, b) can be formulated as:
P = sigmoid(Y ◦ (Xw + b1)), L(w, b) = −1
T
ln P (5)
where 1 = [1, 1, ..., 1]T ∈ R
n
is a n-dimensional column vector, ln(·) is an
element-wise natural logarithm function, sigmoid(z) = 1
1+e−z
is an element-wise
sigmoid function, and “◦” is an element-wise product operator. Similarly, we
can have the gradient for w and b in the matrix form:
∂L(w, b)
∂w
= −X
T
((1 − P) ◦ Y ),
∂L(w, b)
∂b = −1
T
((1 − P) ◦ Y ). (6)
• After obtaining the logistic regression model in Eq. 1 with the optimal w∗
, b∗
from gradient descent, we can use the following decision rule to predict the label
f(x; w∗
, b∗
) ∈ {−1, 1} of the feature vector x:
f(x; w

, b∗
) = (
1, if p(y = +1|x) ≥ 0.5
−1, otherwise
(7)
Besides, the error etraining on the training set Straining = {(xi
, yi)} is defined as:
etraining =
1
n
Xn
i=1
1

yi 6= f(xi
; w

, b∗
)
?
(8)
and we can define the test error etest on the test set Stest in the same way.
Please download the notebook logistic-regression.ipynb from the course Canvas
and fill in the missing blanks. Follow the instructions in the skeleton code and report:
1. Your code.
2. Equation of decision boundary corresponding to the optimal w∗ and b

.
3. Plot of training set along with decision boundary.
4. Plot of test set along with decision boundary.
5. Training error and test error.
6. Training loss curve
5 (30 points) Perceptron
In this section, we will apply perceptron learning algorithm to solve the same binary
classification problem as logistic regression above: We need to predict a binary label
y ∈ {−1, 1} for a feature vector x = [x0, x1]
. The decision rule of the perceptron
model is defined as:
f(x; w, b) = (
1, if wT x + b ≥ 0,
−1, otherwise.
(9)
where w = [w0, w1]
is the weight vector, and b is the bias scalar. Given a training
dataset Straining = {(xi
, yi)}, i = 1, . . . , n}, we define the training error etraining as:
etraining =
1
n
Xn
i=1
1

yi 6= f(xi
; w, b)
?
(10)
and the test error etest on the test set Stest can be defined in the same way. In the
perceptron algorithm, we aim to directly minimize the training error etraining in order
to obtain the optimal parameters w∗
, b∗
. If we represent data points in training set
Straining as matrices X = [x1, x2, . . . , xn]
T and Y = [y1, y2, . . . , yn]
T
, the perceptron
algorithm is shown as below:
Algorithm 1 Perceptron Learning Algorithm
Data: Training set Straining, which contains feature vectors X and labels Y ;
Initialize parameters w and b; pick a constant λ ∈ (0, 1], which is similar to the step
size in the standard gradient descent algorithm (λ = 1 by default).
while not every data point is correctly classified do
for each feature vector xi and its label yi
in the training set Straining do
compute the model prediction f(xi
; w, b);
if yi = f(xi
; w, b) then
continue;
else
update the parameters w and b:
w ⇐ w + λ(yi − f(xi
; w, b))xi
b ⇐ b + λ(yi − f(xi
; w, b))
end
end
en
Please download the notebook perceptron.ipynb from the course website and fill in
the missing blanks. Follow the instructions in the skeleton code and report:
1. Your code.
2. Equation of decision boundary corresponding to the optimal w∗ and b

.
3. Plot of training set along with decision boundary.
4. Plot of test set along with decision boundary.
5. Training error and test error.
6. Training error curve.

More products