$30
ECE M148 Homework 3
Introduction to Data Science
You can access Gradescope directly or using the link provided on BruinLearn.
You may type your homework or scan your handwritten version. Make sure all
the work is discernible.
1. Assume you have a dataset D with n samples. You want to create bootstrapped
datasets of size k using sampling with replacement.
(a) Assume you create one bootstrapped dataset of size k. Additionally, assume that
we fix a data point x ∈ D. What is the probability that x does not appear in the
bootstrapped dataset?
(b) Now, assume that k = n. What does the probability converge to as n goes to
infinity? What does this limit imply about the percentage of the original dataset
that will not be sampled at n gets large?
(c) Assume that you create r bootstrapped datasets of size k each. Additionally,
assume that we fix a data point x ∈ D. What is the probability that x does not
appear in any bootstrapped dataset?
1
2. In this question, let us consider the difference between lasso and ridge regularization.
Recall that the lasso regularization of a vector β is λ
Pk
i=1 |βi
| and that the ridge regularization is λ
Pk
i=1 β
2
i
. Consider two vectors x1 = [4, 5] and x2 = [−2, 2]. Additionally,
set λ = 1.
(a) What is the lasso regularization of x1 and x2? What is the change in the lasso
regularization when going from x1 to x2?
(b) What is the ridge regularization of x1 and x2? What is the change in the ridge
regularization when going from x1 to x2?
(c) In your own words, explain the effects of ridge vs lasso regularization.
2
3. Coding Question - Plot the Voronoi regions for k = 1, 2, 3, 4 using the k-nearest
neighbours classifier on the points: [[1, 1], [4, 1], [2, 3], [3, 3], [3, 4], [5, 4], [6, 5], [4,
5]]. The first 4 points are in class 0 and the rest are in class 1. A .ipynb file has been
provided with starter code to get you started. Did you find anything curious about
the plots? How do you explain them?
3
4. Coding Question - Plot the logistic function 1
1+e−(β0+β1×x)
for x ∈ [−10, 10] and the
following parameter values:
(a) β0 = 2 and β1 = 1
(b) β0 = 10 and β1 = 2
(c) β0 = 1 and β1 = 10
(d) β0 = 1 and β1 = 5
For what choices of β0, β1 does the function become steeper?
4
5. Recall the problem of ridge linear regression with n points and k features:
LRidge(β) = 1
n
Xn
i=1
(yi − β
Txi)
2 + λ
X
k
j=1
β
2
j
where λ is a hyper-parameter. The goal is to minimize LRidge(β) in terms of β for a
fixed training dataset (yi
, xi) and parameter λ.
(a) In your own words, explain the purpose of using ridge regression over standard
linear regression.
(b) As λ gets larger, how will this affect β? What value do we expect β to converge
on?
(c) Consider parameters βλ that were trained using ridge linear regression with a
specific lambda. Let us consider the test MSE using βλ. Note that the test MSE
is the following formula for the test data
1
n
Xn
i=1
(yi − β
T
λ xi)
2
and does not include regularization.
Sketch a plot of how you expect the Test MSE to change as a function of λ. Your
sketch should be a smooth curve that shows how the test MSE changes as λ goes
from 0 to ∞. Provide justification for your plot. Assume that the right most edge
of the graph is where λ is at ∞. Additionally, assume that the linear regression
without regularization is overfitting.
5
6. True of False questions. For each statement, decide whether the statement is True or
False and provide justification (full credit for the correct justification).
(a) In L2 regularization of linear regression, many coefficients will generally be zero.
(b) In the leave one out cross validation over the data set of size N, we create and
train N/2 models.
(c) 95% confidence interval refers to the interval where 95% of the training data lies.
(d) If K out of J features have already been selected in Stepwise Variable Selection,
then we will train J − K new models to select the next feature to add.
(e) P(A|B) = P(B|A) if P(A) = P(B) and P(A) is not zero.
6