$30
CM146
Problem Set 2: Linear Classification and Logistic Regression
Submission Instructions
• Submit your solutions electronically on the course Gradescope site as PDF files.
• If you plan to typeset your solutions, please use the LaTeX solution template. If you must
submit scanned handwritten solutions, please use a black pen on blank white paper and a
high-quality scanner app.
• For questions involving math and derivation, please provide important intermediate steps and
box the final answer clearly.
• You are required to submit the code only for Question 3 - Polynomial Regression in CCLE. For
the sub-questions in Question 3 requiring you to complete a piece of code, you are expected
to copy-paste your code as a part of the solution in the submission pdf to receive full credit.
Parts of this assignment are adapted from course material by Andrew Ng (Stanford), Jenna Wiens (UMich) and
Jessica Wu (Harvey Mudd).
1
1 Linear Model [15 pts]
(a) Design a two-input linear model w1X1 + w2X2 + b that computes the following Boolean
functions. Assume T = 1 and F = 0. If a valid linear model exists, show that it is not unique
by designing another valid linear model. If no such linear model exists, explain why.
(i) OR [5 pts]
(ii) XOR [5 pts]
(b) How many distinct Boolean functions are possible with two Boolean variables? Out of them,
how many can be represented by a linear model? Justify your answers. [5 pts]
2 Logistic Regression and its Variant [45 pts]
Consider the logistic regression model for binary classification that takes input features xn ∈ Rm
and predicts yn ∈ {1, 0}. As we learned in class, the logistic regression model fits the probability
P(yn = 1) using the sigmoid function:
P(yn = 1) = h(xn) = σ(wTxn + b) = 1
1 + exp(−wTxn − b)
. (1)
Given N training data points, we learn the logistic regression model by minimizing the negative
log-likelihood:
J(w, b) = −
X
N
n=1
[yn log h (xn) + (1 − yn) log (1 − h (xn))]. (2)
(a) In the following, we derive the stochastic gradient descent algorithm for logistic regression.
[20 pts]
i. Partial derivatives ∂J
∂wj
and ∂J
∂b , where wj is the j-th element of the weight vector w. [5
+ 5 pts]
ii. Write down the stochastic gradient descent algorithm for minimizing Eq. (2) using the
partial derivatives computed above. [5 pts]
iii. Compare the stochastic gradient descent algorithm for logistic regression with the Perceptron algorithm. What are the similarities and what are the differences? [5 pts]
(b) Instead of using the sigmoid function, we would like to use the following transformation
function:
σA(z) = exp(z) − exp(−z)
exp(z) + exp(−z)
.
Answer the following questions [25 pts]:
i. Plot σA(z) as a function of z in python using matplotlib and numpy libraries. Consider
z ∈ [−10, 10] for the plot. What are the similarities and differences between σ and σA?
What happens as z → ∞ and z → −∞. [5 pts]
ii. Prove the following: [5 pts]
dσA(z)
dz = 1 − σ
2
A(z).
2
iii. Can we assume probability P(yn = 1) = σA(wTxn + b)? Why or why not? [2 pts]
iv. If we assume
P(yn = 1) = hA(xn) = 1 + σA(wTxn + b)
2
.
Given N examples, {xn, yn}
N
n=1, please write down the corresponding negative loglikelihood function JA(w, b). [3 pts]
v. Compute the partial derivatives ∂JA
∂wj
and ∂JA
∂b . [5 pts]
vi. Write down the stochastic gradient descent algorithm for minimizing JA(w, b). [5 pts]
3
3 Implementation: Polynomial Regression [40 pts]
In this exercise, you will work through linear and polynomial regression. Our data consists of
inputs xn ∈ R and outputs yn ∈ R, n ∈ {1, . . . , N}, which are related through a target function
y = f(x). Your goal is to learn a linear predictor hw(x) that best approximates f(x). But this time,
rather than using scikit-learn, we will further open the “black-box”, and you will implement the
regression model!
code and data
• code : Fall2020-CS146-HW2.ipynb
• data : train.csv, test.csv
Please use your @g.ucla.edu email id to access the code and data. Similar to HW-1, copy the colab
notebook to your drive and make the changes. Mount the drive appropriately and copy the shared
data folder to your drive to access via colab. For colab usage demo, check out the Discussion
recordings for Week 2 in CCLE. The notebook has marked blocks where you need to code.
### ========= T ODO : ST ART ========= ###
### ========= T ODO : END ========== ###
Note: For the questions requiring you to complete a piece of code, you are expected
to copy-paste your code as a part of the solution in the submission pdf. Tip: If you
are using LATEX, check out the Minted package (example) for code highlighting.
This is likely the first time that many of you are working with numpy and matrix operations within
a programming environment. For the uninitiated, you may find it useful to work through a numpy
tutorial first.1 Here are some things to keep in mind as you complete this problem:
• If you are seeing many errors at runtime, inspect your matrix operations to make sure that
you are adding and multiplying matrices of compatible dimensions. Printing the dimensions
of variables with the X.shape command will help you debug.
• When working with numpy arrays, remember that numpy interprets the * operator as elementwise multiplication. This is a common source of size incompatibility errors. If you want
matrix multiplication, you need to use the dot function in Python. For example, A*B does
element-wise multiplication while dot(A,B) does a matrix multiply.
• Be careful when handling numpy vectors (rank-1 arrays): the vector shapes 1 × N, N ×
1, and N are all different things. For these dimensions, we follow the the conventions of
scikit-learn’s LinearRegression class2
. Most importantly, unless otherwise indicated (in
the code documentation), both column and row vectors are rank-1 arrays of shape N, not
rank-2 arrays of shape N × 1 or shape 1 × N.
Visualization [2 pts]
1Try out SciPy’s tutorial (http://wiki.scipy.org/Tentative_NumPy_Tutorial), or use your favorite search engine to find an alternative. Those familiar with Matlab may find the “Numpy for Matlab Users” documentation
(http://wiki.scipy.org/NumPy_for_Matlab_Users) more helpful.
2
http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html
4
It is often useful to understand the data through visualizations. For this data set, you can use a
scatter plot to visualize the data since it has only two properties to plot (x and y).
(a) Visualize the training and test data using the plot_data(...) function. What do you observe? For example, can you make an educated guess on the effectiveness of linear regression
in predicting the data? [2 pts]
5
Linear Regression [23 pts]
Recall that linear regression attempts to minimize the objective function
J(w) = X
N
n=1
(hw(xn) − yn)
2
.
In this problem, we will use the matrix-vector form where
y =
y1
y2
.
.
.
yN
, X =
x
T
1
x
T
2
.
.
.
x
T
N
, w =
w0
w1
w2
.
.
.
wD
and each instance xn =
1, xn,1, . . . , xn,D?T
.
In this instance, the number of input features D = 1.
Rather than working with this fully generalized, multivariate case, let us start by considering a
simple linear regression model:
hw(x) = wTx = w0 + w1x1
regression.py contains the skeleton code for the class PolynomialRegression. Objects of this
class can be instantiated as model = PolynomialRegression (m) where m is the degree of the
polynomial feature vector where the feature vector for instance n,
1, xn,1, x2
n,1
, . . . , xm
n,1
?T
. Setting
m = 1 instantiates an object where the feature vector for instance n,
1, xn,1
?T
.
(b) Note that to take into account the intercept term (w0), we can add an additional “feature”
to each instance and set it to one, e.g. xi,0 = 1. This is equivalent to adding an additional
first column to X and setting it to all ones [2 pts].
Modify PolynomialRegression.generate_polynomial_features(...) to create the matrix
X for a simple linear model.
(c) Before tackling the harder problem of training the regression model, complete
PolynomialRegression.predict(...) to predict y from X and w. [3 pts]
(d) One way to solve linear regression is through gradient descent (GD).
Recall that the parameters of our model are the wj values. These are the values we will adjust
to minimize J(w).
J(w) = X
N
n=1
(hw(xn) − yn)
2
In gradient descent, each iteration performs the update
wj ← wj − 2η
X
N
n=1
(hw(xn) − yn) xn,j (simultaneously update wj for all j).
With each step of gradient descent, we expect our updated parameters wj to come closer to
the parameters that will achieve the lowest value of J(w). [10 pts]
• (2 pts) As we perform gradient descent, it is helpful to monitor the convergence by computing the cost, i.e., the value of the objective function J. Complete PolynomialRegression.cost(...)
to calculate J(w).
If you have implemented everything correctly, then the following code snippet should
print the model cost as 230.867214.
model = PolynomialRegression(1)
model.coef_ = np.zeros(2)
c = model.cost (train_data.X, train_data.y)
print(f'model_cost:{c}')
• (3 pts) Next, implement the gradient descent step in PolynomialRegression.fit_GD(...).
The loop structure has been written for you, and you only need to supply the updates
to w and the new predictions ˆy = hw(x) within each iteration.
We will use the following specifications for the gradient descent algorithm:
– We run the algorithm for 10, 000 iterations.
– We terminate the algorithm earlier if the value of the objective function is unchanged
across consecutive iterations.
– We will use a fixed learning rate.
• (5 pts) Experiment with different values of learning rate η = 10−6
, 10−5
, 10−3
, 0.0168
and make a table of the coefficients, number of iterations until convergence (this number
will be 10, 000 if the algorithm did not converge in a smaller number of iterations) and
the final value of the objective function. How do the coefficients compare? How quickly
does each algorithm converge? Do you observe something strange when you run with η
= 0.0168? Explain the observation and causes.
(e) In class, we learned that the closed-form solution to linear regression is
w = (XTX)
−1XT y.
Using this formula, you will get an exact solution in one calculation: there is no “loop until
convergence” like in gradient descent. [4 pts]
• (2 pts) Implement the closed-form solution PolynomialRegression.fit(...).
• (2 pts) What is the closed-form solution coefficients? How do the coefficients and the
cost compare to those obtained by GD? How quickly does the algorithm run compared
to GD?
(f) Finally, set a learning rate η for GD that is a function of k (the number of iterations) (use
ηk =
1
1+k
) and converges to the same solution yielded by the closed-form optimization (minus
possible rounding errors). Update PolynomialRegression.fit_GD(...) with your proposed
learning rate. How many iterations does it take the algorithm to converge with your proposed
learning rate? [4 pts]
Polynomial Regression[15 pts]
Now let us consider the more complicated case of polynomial regression, where our hypothesis is
hw(x) = wT φ(x) = w0 + w1x + w2x
2 + . . . + wmx
m.
7
(g) Recall that polynomial regression can be considered as an extension of linear regression in
which we replace our input matrix X with
Φ =
φ(x1)
T
φ(x2)
T
.
.
.
φ(xN )
T
,
where φ(x) is a function such that φj (x) = x
j
for j = 0, . . . , m.
Update PolynomialRegression.generate_polynomial_features(...) to create an m + 1
dimensional feature vector for each instance. [4 pts]
(h) Given N training instances, it is always possible to obtain a “perfect fit” (a fit in which all
the data points are exactly predicted) by setting the degree of the regression to N − 1. Of
course, we would expect such a fit to generalize poorly. In the remainder of this problem, you
will investigate the problem of overfitting as a function of the degree of the polynomial, m.
To measure overfitting, we will use the Root-Mean-Square (RMS) error, defined as
ERMS =
p
J(w)/N,
where N is the number of instances.3
Why do you think we might prefer RMSE as a metric over J(w)?
Implement PolynomialRegression.rms_error(...). [4 pts]
(i) For m = 0, . . . , 10 (where m is the order of the polynomial transformation applied to features), use the closed-form solver to determine the best-fit polynomial regression model on
the training data, and with this model, calculate the RMSE on both the training data and the
test data. Generate a plot depicting how RMSE varies with model complexity (polynomial
degree) – you should generate a single plot with both training and test error, and include this
plot in your writeup. Which degree polynomial would you say best fits the data? Was there
evidence of under/overfitting the data? Use your plot to justify your answer. [7 pts]
Submission instructions for programming problems
• Please export the notebook to a .py file by clicking the “File” → “Download.py” and upload
to CCLE.
Your code should be commented appropriately. The most important things:
– Your name and the assignment number should be at the top of each file.
– Each class and method should have an appropriate doctsring.
– If anything is complicated, it should include some comments.
There are many possible ways to approach the programming portion of this assignment, which
makes code style and comments very important so that staff can understand what you did.
For this reason, you will lose points for poorly commented or poorly organized code.
• Please submit all the plots and the rest of the solutions (other than codes) to Gradescope
3Note that the RMSE as defined is a biased estimator. To obtain an unbiased estimator, we would have to divide
by n − k, where k is the number of parameters fitted (including the constant), so here, k = m + 1.
8