$30
Problem Set 2: Perceptron and Regression
1 Perceptron [10 pts] (Write your answer in the quiz at CCLE)
1.1 [5 pts]
Design (specify w for) a two-input perceptron (with an additional bias or offset term) that computes
the following boolean functions. Assume T(rue) = 1 and F(alse) = −1.
[AND]
x1 x2 y
−1 −1 −1
−1 +1 −1
+1 −1 −1
+1 +1 +1
(a) w = (−1, 1, 1)
(b) w = (−1, 0.5, 0.6)
(c) Both (a) and (b)
(d) No solution is possible
1.2 [5 pts]
Design (specify w for) a two-input perceptron (with an additional bias or offset term) that computes
the following boolean functions. Assume T(rue) = 1 and F(alse) = −1.
[XOR]
x1 x2 y
−1 −1 −1
−1 +1 +1
+1 −1 +1
+1 +1 −1
(a) w = (−1, 1, 1)
(b) w = (−1, 0.5, 0.6)
(c) Both (a) and (b)
(d) No solution is possible
2 Logistic Regression [10 pts] (Write your answer in the quiz at
CCLE)
Consider the objective function that we minimize in logistic regression:
J(w) = −
X
N
n=1
yn log σ
wTxn
?
+ (1 − yn) log
1 − σ
wTxn
??
Find the partial derivatives ∂J
∂wj
. (Note, wj is the j-th element of the weight vector w.
(a) PN
n=1
σ
wTxn
?
− yn
?
xn,j
(b) PN
n=1 (J(w) − yn) xn
(c) PN
n=1
log σ
wTxn
?
− yn
?
xn,j
(d) PN
n=1 (1 − yn) log
1 − σ
wTxn
?
)
?
x
3 Understanding Linear Separability [30 pts]
Definition 1 (Linear Program) A linear program can be stated as follows:
Let A be an m × n real-valued matrix, ~b ∈ R
m, and ~c ∈ R
n
. Find a ~t ∈ R
n
, that minimizes the
linear function
z(~t) = ~c T~t
subject to A~t ≥ ~b
In the linear programming terminology, ~c is often referred to as the cost vector and z(~t) is referred
to as the objective function.
1 We can use this framework to define the problem of learning a linear
discriminant function.2
The Learning Problem:3 Let ~x1, ~x2, . . . , ~xm represent m samples, where each sample ~xi ∈ R
n
is an n-dimensional vector, and ~y ∈ {−1, 1}
m is an m × 1 vector representing the respective labels
of each of the m samples. Let ~w ∈ R
n be an n × 1 vector representing the weights of the linear
discriminant function, and θ be the threshold value.
We predict ~xi to be a positive example if ~w
T ~xi + θ ≥ 0. On the other hand, we predict ~xi to be a
negative example if ~w
T ~xi + θ < 0.
We hope that the learned linear function can separate the data set. That is,
yi =
(
1 if ~w
T ~xi + θ ≥ 0
−1 if ~w
T ~xi + θ < 0.
(1)
In order to find a good linear separator, we propose the following linear program:
min δ
subject to yi( ~w
T
~xi + θ) ≥ 1 − δ, ∀(~xi
, yi) ∈ D
δ ≥ 0 (2)
where D is the data set of all training examples.
1Note that you don’t need to really know how to solve a linear program since we will use Matlab to obtain the
solution (although you may wish to brush up on Linear Algebra). Also see the appendix for more information on
linear programming.
2This discussion closely parallels the linear programming representation found in Pattern Classification, by Duda,
Hart, and Stork.
3Note that the notation used in the Learning Problem is unrelated to the one used in the Linear Program
definition. You may want to figure out the correspondence.
4
(a) (15 pts) A data set D = {(~xi
, yi)}
m
i=1 that satisfies condition (1) above is called linearly
separable. Show that there is an optimal solution with δ = 0, then D is linearly separable.
(b) (5 pts) What can we say about the linear separability of the data set if there exists a
hyperplane that satisfies condition (2) with δ 0?
(c) (5 pts) An alternative LP formulation to (2) may be
min δ
subject to yi( ~w
T
~xi + θ) ≥ −δ, ∀(~xi
, yi) ∈ D
δ ≥ 0
Find the optimal solution to this formulation (independent of D) to illustrate the issue with
such a formulation.
5
(d) (5 pts) Let ~x1 ∈ R
n
, ~x1
T =
1 1 1
and y1 = 1. Let ~x2 ∈ R
n
, ~x2
T =
−1 −1 −1
and
y2 = −1. The data set D is defined as
D = {( ~x1, y1),( ~x2, y2)}.
Consider the formulation in (2) applied to D. What are possible optimal solutions?
6
4 Implementation: Polynomial Regression [50 pts]
In this exercise, you will work through linear and polynomial regression. Our data consists of
inputs xn ∈ R and outputs yn ∈ R, n ∈ {1, . . . , N}, which are related through a target function
y = f(x). Your goal is to learn a linear predictor hθ(x) that best approximates f(x). But this time,
rather than using scikit-learn, we will further open the “black-box”, and you will implement the
regression model!
code and data
• code : regression.py
• data : regression_train.csv, regression_test.csv
This is likely the first time that many of you are working with numpy and matrix operations within
a programming environment. For the uninitiated, you may find it useful to work through a numpy
tutorial first.4 Here are some things to keep in mind as you complete this problem:
• If you are seeing many errors at runtime, inspect your matrix operations to make sure that
you are adding and multiplying matrices of compatible dimensions. Printing the dimensions
of variables with the X.shape command will help you debug.
• When working with numpy arrays, remember that numpy interprets the * operator as elementwise multiplication. This is a common source of size incompatibility errors. If you want
matrix multiplication, you need to use the dot function in Python. For example, A*B does
element-wise multiplication while dot(A,B) does a matrix multiply.
• Be careful when handling numpy vectors (rank-1 arrays): the vector shapes 1 × N, N ×
1, and N are all different things. For these dimensions, we follow the the conventions of
scikit-learn’s LinearRegression class5
. Most importantly, unless otherwise indicated (in
the code documentation), both column and row vectors are rank-1 arrays of shape N, not
rank-2 arrays of shape N × 1 or shape 1 × N.
Visualization [5 pts]
As we learned last week, it is often useful to understand the data through visualizations. For this
data set, you can use a scatter plot to visualize the data since it has only two properties to plot (x
and y).
(a) (5 pts) Visualize the training and test data using the plot_data(...) function. What do
you observe? For example, can you make an educated guess on the effectiveness of linear
regression in predicting the data?
4Try out SciPy’s tutorial (http://wiki.scipy.org/Tentative_NumPy_Tutorial), or use your favorite search engine to find an alternative. Those familiar with Matlab may find the “Numpy for Matlab Users” documentation
(http://wiki.scipy.org/NumPy_for_Matlab_Users) more helpful.
5
http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html
7
Linear Regression [28 pts]
Recall that linear regression attempts to minimize the objective function
J(w) = X
N
n=1
(hw(xn) − yn)
2
.
In this problem, we will use the matrix-vector form where
y =
y1
y2
.
.
.
yN
, X =
x
T
1
x
T
2
.
.
.
x
T
N
, w =
w0
w1
w2
.
.
.
wD
and each instance xn =
1, xn,1, . . . , xn,D?T
.
In this instance, the number of input features D = 1.
Rather than working with this fully generalized, multivariate case, let us start by considering a
simple linear regression model:
hw(x) = wTx = w0 + w1x1
regression.py contains the skeleton code for the class PolynomialRegression. Objects of this
class can be instantiated as model = PolynomialRegression (m) where m is the degree of the
polynomial feature vector where the feature vector for instance n,
1, xn,1, x2
n,1
, . . . , xm
n,1
?T
. Setting
m = 1 instantiates an object where the feature vector for instance n,
1, xn,1
?T
.
(b) (2 pts) Note that to take into account the intercept term (w0), we can add an additional
“feature” to each instance and set it to one, e.g. xi,0 = 1. This is equivalent to adding an
additional first column to X and setting it to all ones.
Modify PolynomialRegression.generate_polynomial_features(...) to create the matrix
X for a simple linear model.
(c) (2 pts) Before tackling the harder problem of training the regression model, complete
PolynomialRegression.predict(...) to predict y from X and w.
(d) (12 pts) One way to solve linear regression is through gradient descent (GD).
Recall that the parameters of our model are the wj values. These are the values we will adjust
to minimize J(w). In gradient descent, each iteration performs the update
wj ← wj − 2α
X
N
n=1
(hw(xn) − yn) xn,j (simultaneously update wj for all j).
With each step of gradient descent, we expect our updated parameters wj to come closer to
the parameters that will achieve the lowest value of J(w).
• (4 pts) As we perform gradient descent, it is helpful to monitor the convergence by computing the cost, i.e., the value of the objective function J. Complete PolynomialRegression.cost(...)
to calculate J(w).
If you have implemented everything correctly, then the following code snippet should
return 40.234.
train_data = load_data('regression_train.csv')
model = PolynomialRegression()
model.coef_ = np.zeros(2)
model.cost(train_data.X, train_data.y)
• (3 pts) Next, implement the gradient descent step in PolynomialRegression.fit_GD(...).
The loop structure has been written for you, and you only need to supply the updates
to w and the new predictions ˆy = hw(x) within each iteration.
We will use the following specifications for the gradient descent algorithm:
– We run the algorithm for 10, 000 iterations.
– We terminate the algorithm ealier if the value of the objective function is unchanged
across consecutive iterations.
– We will use a fixed step size.
• (5 pts) So far, you have used a default learning rate (or step size) of η = 0.01. Try
different η = 10−4
, 10−3
, 10−2
, 0.0407, and make a table of the coefficients, number of
iterations until convergence (this number will be 10, 000 if the algorithm did not converge
in a smaller number of iterations) and the final value of the objective function. How do
the coefficients compare? How quickly does each algorithm converge?
(e) (7 pts) In class, we learned that the closed-form solution to linear regression is
w = (XTX)
−1XT y.
Using this formula, you will get an exact solution in one calculation: there is no “loop until
convergence” like in gradient descent.
• (2 pts) Implement the closed-form solution PolynomialRegression.fit(...).
• (5 pts) What is the closed-form solution? How do the coefficients and the cost compare
to those obtained by GD? How quickly does the algorithm run compared to GD?
(f) (5 pts) Finally, set a learning rate η for GD that is a function of k (the number of iterations)
(use ηk =
1
1+k
) and converges to the same solution yielded by the closed-form optimization
(minus possible rounding errors). Update PolynomialRegression.fit_GD(...) with your
proposed learning rate. How long does it take the algorithm to converge with your proposed
learning rate?
Polynomial Regression[17 pts]
Now let us consider the more complicated case of polynomial regression, where our hypothesis is
hw(x) = wT φ(x) = w0 + w1x + w2x
2 + . . . + w
mx
m.
9
(g) (2 pts) Recall that polynomial regression can be considered as an extension of linear regression in which we replace our input matrix X with
Φ =
φ(x1)
T
φ(x2)
T
.
.
.
φ(xN )
T
,
where φ(x) is a function such that φj (x) = x
j
for j = 0, . . . , m.
Update PolynomialRegression.generate_polynomial_features(...) to create an m + 1
dimensional feature vector for each instance.
(h) (5 pts) Given N training instances, it is always possible to obtain a “perfect fit” (a fit in
which all the data points are exactly predicted) by setting the degree of the regression to
N − 1. Of course, we would expect such a fit to generalize poorly. In the remainder of
this problem, you will investigate the problem of overfitting as a function of the degree of
the polynomial, m. To measure overfitting, we will use the Root-Mean-Square (RMS) error,
defined as
ERMS =
p
J(w)/N,
where N is the number of instances.6
Why do you think we might prefer RMSE as a metric over J(w)?
Implement PolynomialRegression.rms_error(...).
(i) (10 pts) For m = 0, . . . , 10, use the closed-form solver to determine the best-fit polynomial
regression model on the training data, and with this model, calculate the RMSE on both the
training data and the test data. Generate a plot depicting how RMSE varies with model
complexity (polynomial degree) – you should generate a single plot with both training and
test error, and include this plot in your writeup. Which degree polynomial would you say
best fits the data? Was there evidence of under/overfitting the data? Use your plot to justify
your answer.
6Note that the RMSE as defined is a biased estimator. To obtain an unbiased estimator, we would have to divide
by n − k, where k is the number of parameters fitted (including the constant), so here, k = m + 1.
10