$30
Department of Electrical Engineering and Computer Science
EECS 445 Introduction to Machine Learning
Homework 1
Submission: Please upload your completed assignment to Gradescope. Include all code as an
appendix following your write-up.
1 Vectors & Planes Review (4 pts)
(a) (1pt) Give a formula for computing the angle between two d-dimensional vectors x¯
(1) and x¯
(2). Suppose x¯
(1) = [−2, −1.5] and x¯
(2) = [−3, 4], what is the magnitude of each vector and what is the angle
between them?
(b) (1pt) Consider a hyperplane p¯ ∈ R
d
that passes through the origin, which includes all points x¯ such
that θ1x1+θ2x2+...+θdxd = 0. We can say that the d-dimensional vector ¯θ = [θ1, ..., θd]
T describes
p¯. Give a d-dimensional vector that is orthogonal to p¯. How many such alternative possible vectors
are there? Explain.
(c) Consider an arbitrary d-dimensional point x¯ ∈ R
d
and a hyperplane through the origin described by
the normal vector ¯θ = [θ1, . . . , θd]
T
. The signed distance of x¯ from the plane is the perpendicular
distance between x¯ and the hyperplane, multiplied by +1 if x¯ lies on the same side of the plane as the
vector ¯θ points and by −1 if x¯ lies on the opposite side.
(i) (1pt) Derive the expression for the signed distance of x¯ from the hyperplane.
(ii) (0.5pt) Now consider a hyperplane with offset. How does the solution from part (i) change?
(iii) (0.5pt) Let p¯ be the plane consisting of the set of points for which x1 −2x2 + 1 = 0. What is the
signed perpendicular distance of point a¯ = [−7, 9]T
from p¯? What is the signed perpendicular
distance of the origin from p¯?
2 Linear Algebra Review (6pts)
For each statement below, either prove the statement is true or give a counterexample demonstrating that it
is false.
(a) (2pts) For A, B ∈ R
n×m, AT + BT = (A + B)
T
.
(b) (2pts) The inverse of a symmetric matrix is still symmetric.
(c) (2pts) Let A, B ∈ R
n×n
are invertible matrices. Then, (AB)
−1 = B−1A−1
.
EECS 445, Fall 2019 – Homework 1, Due: Tuesday, Sept. 24 at 11:59pm 2
3 Decision Boundaries (4 pts)
Note: Throughout this course, we will use the convention that points that lie on the decision boundary are
misclassified.
(a) (2 pts) Consider the AND function defined over three binary variables: f(x1, x2, x3) = (x1 ∧ ¬x2 ∧
x3). We aim to find a ¯θ ∈ R
3
such that for any x¯ = [x1, x2, x3], where xi ∈ {0, 1}:
¯θ · x¯ + b 0 when f(x1, x2, x3) = 1, and
¯θ · x¯ + b < 0 when f(x1, x2, x3) = 0.
(i) If b = 0 (i.e., no offset), would it be possible to learn such a ¯θ? Explain.
(ii)ii. How about if a non-zero offset is allowed? Provide an example of such ¯θ and b, or explain why
it’s not possible.
(b)ii. (2 pts) You are given the following labeled data points:
• Negative examples: [1, −3] and [−2, −1].
• Positive examples: [0, 1] and [2, −2].
For each of the following parameterized families of classifiers, give an example (find the parameters)
of a family member that can correctly classify the above data, or explain why no such family member
exists.
(i) Above or below a line through the origin with normal [x, y].
(ii)ii. Above or below a line with normal [x, y] and offset b.
(iii)ii. Inside or outside of an origin-centered square with sides parallel to the axes and of length s.
(iv)ii. Inside or outside of an [x, y]-centered square with sides parallel to the axes and of length s.
EECS 445, Fall 2019 – Homework 1, Due: Tuesday, Sept. 24 at 11:59pm 3
4 Perceptron Algorithm with Offset (5 pts)
Consider a sequence of 2-dimensional data points, x¯
(1)
, x¯
(2)
, . . . , x¯
(n)
and their corresponding labels,
y
(1), y(2), . . . , y(n)
. Recall the perceptron algorithm updates the parameters whenever y
(i) 6= h(¯x
(i)
;
¯θ, b)
where h(¯x
(i)
;
¯θ, b) = sign(¯θ · x¯
(i) + b). Assume that the points are linearly separable, and that both ¯θ and b
are initialized to zero. Let αi denote the number of times x¯
(i)
is misclassified during training.
(a)ii. (1 pt) Derive the final decision boundary for the perceptron in terms of αi
, x¯
(i)
and y
(i)
.
(b)a (1 pt) Show that the shortest signed distance from the boundary to the origin is equal to b
||θ¯|| .
(b)ac (2 pts) Table 1 shows a dataset and the number of times each point is misclassified when one applies
the perceptron algorithm (with offset).
i x
(i) y times misclassified
1 [0, -1] 1 0
2 [ 2,-1] 1 2
3 [ 1, 1] 1 9
4 [ 2, 2] -1 4
5 [ 3, 1] -1 2
6 [ 4, 2] -1 0
Table 1: Dataset and misclassification counts for 4(c)
Assuming ¯θ and b are initialized to zero, what are the values for ¯θ and b post training?
(b)dac (1 pt) Given a set of linearly separable points, does the order in which the points are presented to the
algorithms affect whether or not the algorithm will converge? In general, could the order affect the
total number of mistakes made?
EECS 445, Fall 2019 – Homework 1, Due: Tuesday, Sept. 24 at 11:59pm 4
5 Linear Regression - Optimization Methods (11 pts)
For this question, we have provided you with skeleton code q5 linear regression.py. We have
also provided data files, q5 train.csv and q5 test.csv which specify a linear regression problem
for a polynomial. The first column of each file represents the outputs (y
(i) ∈ R) while the second column
represents the input (x
(i) ∈ R), with one training/testing example per row.
(a) You will start by implementing two different optimization methods to find the coefficients θ1 and
θ0 (slope and intercept, respectively) that minimize the error for a first degree polynomial i.e., y =
θ1x + θ0.
i (2 pts) First, implement the function generate polynomial features(X,M) according
to the specification in the skeleton code. This function creates an M +1 dimensional feature vector, φ(x
(i)
)for each example x
(i)
in X. Next, implement ls stochastic gradient descent(X,y)
and find the coefficients (slope and intercept) that minimize the error for a first degree polynomial. For this problem you will have to choose a learning rate (or step size) η. Try different
values of η ∈ {10−5
, 10−4
, 10−3
, 10−2
, 10−1}. Note the convergence criteria specified in the
code: your implementation of SGD should terminate when there is either marginal improvement
in the loss (< 10−10) over the course of a single iteration, or after 100000 iterations—whichever
happens first. How fast does the algorithm (in terms of the number of iterations, where one
iteration is one loop over the entire dataset) converge with each step size?
ii.i (2 pts) Implement the function closed form optimization(X,y) based on the closed
form solution presented in lecture (ignore the reg param for now) and use it to find the coefficients (slope and intercept) that minimize the error for a first degree polynomial. In this
particular example, how does the runtime of the closed form solution compare to stochastic gradient descent? Which learning rate used in part (i) produces the coefficients closest to the closed
form solution?
iii. (1 pt) Propose a learning rate ηk that is a function of k (the number of iterations). Are the
coefficients produced by using your proposed learning rate close to the closed form formula?
How long does the algorithm (in terms of number of iterations) take to converge with your
proposed learning rate?
Note that you may find it useful to generate 2D-plots for the training data and the output of each
regression function given the training data.
(b)ii. Next, you will investigate the problem of overfitting. Here, you will consider overfitting as a function
of the degree of the polynomial, M, where the Root-Mean-Square (RMS) Error is defined as
ERMS =
q
E(
¯θ)/N,
where
E(
¯θ) = X
N
i=1
(
¯θ · φ(x
(i)
) − y
(i)
)
2
and where N is the number of examples.
i (1 pt) Implement the function calculate RMS Error(X,y,theta) according to the specification given in the skeleton code. Using this function and the closed form solution from part
EECS 445, Fall 2019 – Homework 1, Due: Tuesday, Sept. 24 at 11:59pm 5
(a), find the coefficients that minimize the error for an Mth degree polynomial (for M = 0 . . . 10)
for the training data specified in q5 train.csv. Calculate the RMS Error, for each setting
of M, on the training data and on the test data q5 test.csv. Plot Erms against M for both
training data and test data (in the same graph) and include it in your solution write-up. Your
graph should look similar to the following:
ii.i (1 pt) Which degree polynomial would you say best fits the data? Was there any evidence of
under/overfitting the data? Use your generated chart to defend your answer.
(c)ii. Finally, you will explore the role of regularization.
i (1 pt) Modify your implementation of the closed form optimization(X,y,reg param=0)
from part (a) to incorporate L2-regularization. Specifically, use the following regularized objective function:
1
2
X
N
i=1
(
¯θ · φ(x
(i)
) − y
(i)
)
2 +
λ
2
||¯θ||2
2
.
for optimizing the parameters ¯θ.
ii.i (1 pt) Use your function from part (i) to find the coefficients that minimize the error for a tenth degree polynomial (M = 10) given regularization factor λ (for λ ∈ {0, 10−8
, 10−7
, . . . , 10−1
, 100})
for the training data specified in q5 train.csv. Now use these coefficients to calculate the
RMS Error (unregularized) on both the training data and test data as a function of λ and plot
ERMS against λ ∈ {0, 10−8
, 10−7
, . . . , 10−1
, 100}.
iii. (1 pt) Which λ value appears to work the best?
ivi. (1pt) Suppose that instead of a tenth degree polynomial (M = 10), you trained a twentieth
degree polynomial (M = 20) model. Assume that you also used the same optimal coefficient
for the regularization term λ that you found in the previous part. Which of the two models would
likely perform better (have lower RMS error) on the training data and why? Which of the two
models would likely perform better on the testing data and why? Please limit your answer to
3-4 sentences.
EECS 445, Fall 2019 – Homework 1, Due: Tuesday, Sept. 24 at 11:59pm 6
6 Logistic regression (6 pts)
Logistic regression for a binary classification task is defined as follows: with x¯ ∈ R
d
and y ∈ {−1, +1},
the classifier h(¯x;
¯θ) = σ(
¯θ · x¯) = 1
1+exp(−θ¯·x¯)
outputs the probability that an observation x¯ is classified
with label y = +1. Suppose we have a training set of N independent observations, S = {x¯
(i)
, y(i)}
N
i=1.
Recall that σ(−z) = 1 − σ(z).
(a)ii. (2pts) The empirical risk with logistic loss is defined as
J(
¯θ) = 1
N
X
N
i=1
ln(1 + e
−y
(i)θ¯·x¯
(i)
)
Find ∇θ¯J(
¯θ).
(b) (2pts) Similar to gradient descent, Newton’s method is also an iterative method for optimization. The
update rule for Newton’s method is ¯θ
(k+1) = ¯θ
(k) −
H−1 ∇θ¯J(
¯θ)
?
|θ¯=θ¯(k)
, where H is the Hessian
of the empirical risk.
Find the Hessian H for the logistic regression empirical risk; specifically, show that the j, k-th entry
(1 ≤ j, k ≤ d) of the Hessian matrix can be written as:
Hjk =
1
N
X
N
i=1
σ(y
(i) ¯θ · x¯
(i)
)
?
1 − σ(y
(i) ¯θ · x¯
(i)
)
?
x
(i)
j
x
(i)
k
,
and then write down an expression for computing matrix H.
(c) (1pt) The Hessian matrix H has two important applications in calculus, quadratic approximations of
multivariable functions and second derivative test. In the second application, the Hessian matrix can
be used to test whether a function is concave or convex, and whether a critical point (a point whose
gradient is zero) is a local minimum, a local maximum or a saddle point.
Show that the Hessian H is positive semi-definite, i.e., show that v¯
T Hv¯ ≥ 0 for any vector v¯ ∈ R
d
.
Conclude that J(
¯θ) is convex and has no local minima other than the global one (you don’t need to
prove this conclusion).
(d) (1pt) We will now use Newton’s method to minimize our empirical risk. Using your answers from
part (a) and (b), write down the update rule implied by Newton’s method. (Note: Your solution can
contain matrix inverse; no need to simplify further. The final expression should be in terms of x¯
(i)
,
y
(i)
,
¯θ, N and sigmoid function.
EECS 445, Fall 2019 – Homework 1, Due: Tuesday, Sept. 24 at 11:59pm 7
7 Weighted Linear Regression (5 pts)
Consider a linear regression problem in which we want to “weigh” individual training examples differently.
Specifically, suppose we want to minimize the cost function for weighted linear regression
J(
¯θ) = 1
2
Xn
i=1
w
(i)
?
¯θ · x¯
(i) − y
(i)
?2
where Sn = {x¯
(i)
, y(i)}
n
i=1 is the set of training examples.
(a) (2pts) Show that J(
¯θ) can also be written as
J(
¯θ) = (X ¯θ − y)
TW(X ¯θ − y)
for an appropriate diagonal matrix W, and where X = [¯x
(1)
, x¯
(2)
, . . . , x¯
(n)
]
T
and y = [y
(1), y(2), . . . , y(n)
]
T
.
Clearly state what W is.
(b) (2pts) Find the closed-form solution for ¯θ by taking the gradient of J(
¯θ) with respect to ¯θ and setting
that to zero (Hint: If you choose to express the closed-form solution in matrix form, remember that
∇x¯(¯y
T x¯) = ∇x¯(¯y
T x¯)
T
since scalars are symmetric matrices.)
(c) (1pt) In no more than three sentences, describe an example in which weighted linear regression might
be more useful than linear regression, and briefly explain your reasoning.
EECS 445, Fall 2019 – Homework 1, Due: Tuesday, Sept. 24 at 11:59pm 8
References
We are using Python 3.7 for this course (you can find references for the standard library here: https://docs.
python.org/3/library/). To make your life easier, we require you to install the latest Anaconda for Python
3.7.x (https://www.anaconda.com/download/). This is a Python package manager that includes most of the
modules you need for this course.
We will make use of the following packages extensively:
• numpy (https://docs.scipy.org/doc/numpy-1.15.0/user/quickstart.html)
• matplotlib (http://matplotlib.org/users/pyplot tutorial.html)
• scikit-learn (http://scikit-learn.org/stable/documentation.html)
For this homework, you only need to familiarize yourself with numpy and matplotlib. You can check
out the materials from the review session on Python and numpy.
REMEMBER to submit your completed assignment to Gradescope by 11:59pm ET on Tuesday,
Sept. 24. Include all code as an appendix following your write-up.