$30
Assignment 2
COGS 118A: Supervised Machine Learning Algorithms
Instructions: Answer the questions below, attach your code, and insert figures to create a
PDF file; submit your file via Gradescope. You may look up the information on the Internet,
but you must write the final homework solutions by yourself.
Grade: out of 100 points
1 (10 points) Conceptual Questions
(1.1) Is the following statement true or false?
f(x) is linear with respect to x, given f(x) = w0 + w1x + w2x
2 where x, w0, w1, w2 ∈ R.
[True] [False]
(1.2) “One-hot encoding” is a standard technique that turns categorical features into general
real numbers. If we have a dataset S containing m data points where each data point
has 1 categorical feature. Specifically, this categorical feature has k possible categories.
Thus, the shape of the one-hot encoding matrix that represents the dataset S is:
A. k × k
B. 1 × k
C. m × k
D. m × m
(1.3) Assume we have a binary classification model:
f(x) = (
+1, w · x + b ≥ 0,
−1, w · x + b < 0
where the feature vector x = (x1, x2) ∈ R
2
, bias b ∈ R, weight vector w = (w1, w2) ∈
R
2
. The decision boundary of the classification model is:
w · x + b = 0
1
(a) If the predictions of the classifier f and its decision boundary w · x + b = 0 are
shown in Figure 1, which one below can be a possible solution of weight vector w
and bias b?
A. w = (+1, 0), b = −1.
B. w = (−1, 0), b = +1.
C. w = (+1, 0), b = +1.
D. w = (0, −1), b = −1.
Figure 1: Decision Boundary 1
(b) If the predictions of the classifier f and its decision boundary w · x + b = 0 are
shown in Figure 2, which one below can be a possible solution of weight vector w
and bias b?
A. w = (+1, 0), b = −1.
B. w = (−1, 0), b = +1.
C. w = (+1, 0), b = +1.
D. w = (0, −1), b = −1.
Figure 2: Decision Boundary 2
(1.4) Choose the most significant difference between regression and classification:
A. unsupervised learning vs. supervised learning.
B. prediction of continuous values vs. prediction of class labels.
C. features are not one-hot encoded vs features are one-hot encoded.
D. none of the above.
2
2 (25 points) Decision Boundary
2.1 (3 points)
We are given a classifier that performs classification in R
2
(the space of data points with 2
features (x1, x2)) with the following decision rule:
h(x1, x2) = ?
1, if 2x1 + 4x2 − 8 ≥ 0
0, otherwise.
Draw the decision boundary of the classifier and shade the region where the classifier predicts
1. Make sure you have marked the x1 and x2 axes and the intercepts on those axes.
2.2 (9 points)
We are given a classifier that performs classification on R
2
(the space of data points with 2
features (x1, x2)) with the following decision rule:
h(x1, x2) = ?
1, if w1x1 + w2x2 + b ≥ 0
0, otherwise.
Here, the normal vector w of the decision boundary is normalized, i.e.:
||w||2 =
q
w2
1 + w2
2 = 1.
1. Compute the parameters w1, w2 and b for the decision boundary in Figure 3. Please
make sure the predictions from the obtained classifier are consistent with Figure 3.
Hint: Please use the intercepts in the Figure 3 to find the relation between w1, w2 and
b. Then, substitute it into the normalization constraint to solve for parameters.
2. Use parameters from the above question to compute predictions for the following two
data points: A = (3, 6), B = (1, −4).
Figure 3: Decision boundary to solve for parameters.
3
2.3 (10 points)
We are given a classifier that performs classification on R
3
(the space of data points with 3
features (x1, x2, x3)) with the following decision rule:
h(x1, x2, x3) = ?
1, if w1x1 + w2x2 + w3x3 + b ≥ 0
0, otherwise.
Here, the normal vector w of the decision boundary is normalized, i.e.:
||w||2 =
q
w2
1 + w2
2 + w2
3 = 1.
In addition, we set b ≤ 0 to have an unique equation for the decision boundary.
1. Compute the parameters w1, w2, w3 and b for the decision boundary that passes through
three points A = (3, 2, 4), B = (−1, 0, 2), C = (4, 1, 5) in Figure 4.
Hint: Please use the intercepts in the Figure 4 to find the relation between w1, w2, w3
and b. Then, substitute it into the normalization constraint to solve for parameters.
2. Use parameters from the above question to compute predictions for the following two
data points: D = (0, 0, 0), E = (1, 0, 5).
Figure 4: Decision boundary to solve the parameters.
2.4 (3 points)
We are given a classifier that performs classification in R
2
(the space of data points with 2
features (x1, x2)) with the following decision rule:
h(x1, x2) = ?
1, if x
2
1 + x
2
2 − 10 ≥ 0
0, otherwise.
Draw the decision boundary of the classifier and shade the region where the classifier predicts
1. Make sure you have marked the x1 and x2 axes and the intercepts on those axes.
4
3 (10 points) Derivatives
3.1 Function Defined by Scalars
1. (3 points) Given a function f(w) = (y1 +wx1)
2 where (x1, y1) = (3, 4) represents a data
point, derive ∂f(w)
∂w .
2. (3 points) Given a function f(w) = P
i∈{1,2}
(yi−wxi)
2 where (x1, y1) = (1, 1),(x2, y2) =
(2, 3) are two data points, derive ∂f(w)
∂w .
3.2 Function Defined by Vectors
1. (4 points) Given a function f(w) = (y−wx)
T
(y−wx) where x = [1, 2]T and y = [1, 3]T
,
derive ∂f(w)
∂w .
Note: In f(w), w ∈ R is still a scalar.
5
4 (9 points) Concepts
Select the correct option(s). Note that there might be multiple correct options.
1. For two monotonically increasing functions f(x) and g(x):
A. f(x) + g(x) is always monotonically increasing.
B. f(x) − g(x) is always monotonically increasing.
C. f(x
2
) is always monotonically increasing.
D. f(x
3
) is always monotonically increasing.
2. For a function f(x) = x(10 − x), x ∈ R, please choose the correct statement(s) below:
A. arg maxx f(x) = 5.
B. arg minx f(x) = 25.
C. minx f(x) = 5.
D. maxx f(x) = 25.
3. Assume we have a function f(x) which is differentiable at every x ∈ R. There are three
properties that describe the function f(x):
(1) f(x) is a convex function.
(2) When x = x0, f
0
(x0) = 0.
(3) f(x0) is a global minimum of f(x).
Which one of the following statements is wrong?
Hint: You can use a failure case to disprove a statement.
A. Given (1) and (2), we can prove that (3) holds.
B. Given (2) and (3), we can prove that (1) holds.
C. Given (1) and (3), we can prove that (2) holds.
6
5 (4 points) Argmin and Argmax
An unknown estimator is given an estimation problem to find the minimizer and maximizer
of the objective function G(w) ∈ (0, 2]:
(wa, wb) = (arg min
w
G(w), arg max
w
G(w)). (1)
The solution to Eq. 1 by the estimator is (wa, wb) = (10, 20).
Given this information, please obtain the value of w
∗
such that:
w
∗ = arg min
w
[10 − 4 × ln(G(w))]. (2)
7
6 (12 points) Data Manipulation
In this question, we still use the Iris dataset from Homework 1. In fact, you can see the shape
of array X is (150, 4) by running X.shape, which means it contains 150 data points where
each has 4 features. Here, we will perform some basic data manipulation and calculate some
statistics:
1. Divide array X evenly to five subsets of data points:
Group 1: 1st to 30th data point,
Group 2: 31st to 60th data point,
Group 3: 61st to 90th data point,
Group 4: 91st to 120th data point,
Group 5: 121st to 150th data point.
Then calculate the mean of feature vectors in each group. Your results should be five
4-dimensional vectors (i.e. shape of NumPy array can be (4,1), (1,4) or (4,)).
2. Remove 2nd and 3rd features from array X, resulting a 150×2 matrix. Then calculate
the mean of all feature vectors. Your result should be a 2-dimensional vector.
3. Remove last 10 data points from array X, resulting a 140 × 4 matrix. Then calculate
the mean of feature vectors. Your result should be a 4-dimensional vector.
8
7 (15 points) Training vs. Testing Errors
In this problem, we are given two trained predictive models on a modified Iris dataset. Each
data point (x, y) has a feature vector xi ∈ R
4 and its corresponding label yi ∈ {0, 1}, where
i ∈ {1, 2, . . . , 150}. To predict on the new data, here we consider two types of model: a
regression model and a classification model. The regression model is trained to predict a real
number, while the classification model applies a threshold to the output of the regression
model, converting the real number into a binary value.
The regression model is as followed:
yˆi(xi) = w
T xi + b
The classifier is as followed:
h(xi) = ?
1, if ˆyi(xi) ≥ 1/2
0, otherwise.
where w = [0.1297, 0.1225, −0.1171, 0.6710]T
, b = −1.1699.
The regression error is defined as:
vuut
1
n
Xn
i=1
( ˆyi − yi)
2
and the classification error is defined as:
1
n
Xn
i=1
1(h(xi) 6= yi)
where n is the number of data points.
The data as well as the split of training and testing set are given in the Jupyter notebook we
provided. You should not use the scikit-learn library.
Please download the notebook training test errors.ipynb from the course website and fill
in the missing blanks. Follow the instructions in the skeleton code and report:
• Training error of the regression model.
• Testing error of the regression model.
• Training error of the classification model.
• Testing error of the classification model.
9
8 (15 points) Linear Regression
Assume we are given a dataset S = {(xi
, yi), i = 1, . . . , n}. Here, xi ∈ R is a feature
scalar (a.k.a. value of input variable) and yi ∈ R is its corresponding value (a.k.a. value of
dependent variable). In this section, we aim to fit data points with a line:
y = w0 + w1x (3)
where w0, w1 ∈ R are two parameters to determine the line. Next, we measure the quality of
fitting by evaluating a sum-of-squares error function g(w0, w1):
g(w0, w1) = Xn
i=1
(w0 + w1xi − yi)
2
(4)
When g(w0, w1) is near zero, it means the proposed line can fit the dataset and model an
accurate relation between xi and yi
. The best line with parameters (w
∗
0
, w∗
1
) can reach the
minimum value of the error function g(w0, w1):
(w
∗
0
, w∗
1
) = arg min
w0,w1
g(w0, w1) (5)
To obtain the parameters of the best line, we will take the gradient of function g(w0, w1) and
set it to zero. That is:
∇g(w0, w1) = 0 (6)
The solution (w
∗
0
, w∗
1
) of the above equation will determine the best line y = w
∗
0 + w
∗
1x that
fits the dataset S.
In reality, we typically tackle this task in a matrix form: First, we represent data points
as matrices X = [x1, x2, . . . , xn]
T and Y = [y1, y2, . . . , yn]
T
, where xi = [1, xi
]
T
is a feature
vector corresponding to xi
. The parameters of the line are also represented as a matrix
W = [w0, w1]
T
. Thus, the sum-of-squares error function g(W) can be defined as (a.k.a.
squared L2 norm):
g(W) = Xn
i=1
(x
T
i W − yi)
2
(7)
= kXW − Y k
2
2
(8)
= (XW − Y )
T
(XW − Y ) (9)
Similarly, the parameters W∗ = [w
∗
0
, w∗
1
]
T of the best line can be obtained by solving the
equation below:
∇g(W) = ∂g(W)
∂W = 0 (10)
10
(a) According to Eq. 8 and 9, compute the gradient of g(W) with respect to W. Your result
should be in the form of X, Y and W.
(b) By setting the answer of part (a) to 0, prove the following:
W∗ = arg min
W
g(W) = (X
T X)
−1X
T Y (11)
Note: The above formula demonstrates a closed form solution of Eq. 10.
11
Previously, we define a sum-of-squares error function g(w0, w1) = Pn
i=1(yi − w0 − w1x1)
2 and
represent it in a matrix form g(W) = kXW − Y k
2
2
. Actually, we can have multiple choices
of the error function: For example, we can define a sum-of-absolute error function h(w0, w1):
h(w0, w1) = Xn
i=1
|w0 + w1xi − yi
| (12)
and represent it in a matrix form h(W) (a.k.a. L1 norm):
h(W) = Xn
i=1
|x
T
i W − yi
| (13)
= kXW − Y k1
(14)
(c) According to the Eq. 13, compute the gradient of the error function h(W) with respect
to W. Your result should be in the form of xi
, yi and W.
Hint: Given a function f(x) ∈ R, we have:
∂|f(x)|
∂x
= sign(f(x))∂f(x)
∂x
where
sign(x) =
1, x 0
0, x = 0
−1, x < 0.
While we can represent the problem as in (c), where the gradient is calculated for each xi
input, it can be very useful to calculate the gradient over an entire dataset. While there’s
not a problem here for you to solve for points, we wanted you to know that problem (c) can
be re-written in terms of the entire training set X as:
∇h(W) = ∂h(W)
∂W =
?
sign(XW − Y )
?X
?
(15)
where sign(A) means performing element-wise sign(aij ) over all elements aij in a matrix A.
This matrix form of the gradient will be useful in next week’s homework.
1