$29.99
HOMEWORK 2
CSC2515
1. Bias and Variance Decomposition for the `2-regularized Mean Estimator – 25pts.
This exercise helps you become more comfortable with the bias and variance calculations and
decomposition. It focuses on the simplified setting of the mean estimator.
Consider a r.v. Y with an unknown distribution p. This random variable has an (unknown)
mean µ = E[Y ] and variance σ
2 = Var[Y ] = E
(Y − µ)
2
. Consider a dataset D = {Y1, . . . , Yn}
with independently sampled Yi ∼ p.
(a) [3pt] Show that the sample average estimator havg =
1
n
Pn
i=1 Yi
is the solution of the
following optimization problem:
havg(D) ← argmin
m∈R
1
n
Xn
i=1
|Yi − m|
2
.
Recall that we have the following bias-variance decomposition for the mean estimator h(D):
ED
h
|h(D) − µ|
2
i
= |ED[h(D)] − µ|
2
| {z }
bias
+ ED
h
|h(D) − ED[h(D)]|
2
i
| {z }
variance
.
(b) [2pt] Compute the bias and variance of havg(D).
(c) [5pt] Consider the `2-regularized mean estimator hλ(D) defined for any λ ≥ 0.
hλ(D) ← argmin
m∈R
1
n
Xn
i=1
|Yi − m|
2 + λ|m|
2
.
Notice that this is similar to the ridge regression, but only for a single random variable.
Provide an explicit formula for the estimator hλ(D) in terms of the sample average and λ.
1
(d) [6pt] Compute the bias and variance of this regularized estimator.
(e) [5pt] Visualize ED
h
|hλ(D) − µ|
2
i
, the bias, and the variance terms for a range of λ. As a
starting point, you can choose µ = 1, σ
2 = 9, and n = 10.
(f) [4pt] Explain the visualization from the previous part, in terms of the effect of bias and
variance on the expected squared error.
2
2. Different Flavours of Regression – 20 pts. We have seen in the lecture how to solve
the regression problem with the linear models and the squared error loss. In this question, you
learn two other regression methods: Poisson regression and locally weighted regression. You will
derive the associated loss functions and solutions for these methods.
Poisson regression: Recall that the squared error has a probabilistic justification. If we assume
that the target values are contaminated with a zero-mean Gaussian noise, the maximum likelihood
estimate is the same as the solution of the minimizer of the least squares loss.
We may use probabilistic models other than the Gaussian distribution to model the targets.
One such choice is the Poisson distribution, which is a discrete probability distribution that can
be used to model the number of times a certain event has happened. The probability mass function
of a random variable Z with a Poisson distribution with the rate λ ∈ (0, ∞) is
P{Z = k; λ} = p(k; λ) = λ
k
e
−λ
k!
(2.1) .
This is the probability that the number of events Z is k ∈ {0, 1, . . . }. It can be shown that
the expected value and the variance of this random variable are both λ, that is, E[Z] = λ and
Var[Z] = λ.
The Poisson distribution can be used to model the number of events in a given amount of time
or space. Some examples are the number of calls to a call centre in an hour, network packages
arriving at a router in a minute, meteorites hitting Earth each year, goals or scores in a soccer or
hockey match, bikes used at a bicycle sharing facility per-hour, and “soldiers killed by horse-kicks
each year in the Prussian cavalry”.
We may use the Poisson distribution to model a regression problem when the target has a count
interpretation. In that case, instead of having a fixed rate λ ∈ (0, ∞), the rate λ is a function
of the input, which means that λ : X → (0, ∞). This modelling might be more natural for some
problems than using a zero-mean Gaussian noise contamination.
(a) [5pt] Suppose that we are given a dataset of {Z1, . . . ZN } all independently drawn from
a Poisson distribution (2.1) with unknown parameter λ. Derive the maximum likelihood
estimate of the λ. Show the individual steps and note where you use the data assumptions
(identically and independently distributed).
Hint: Use the standard approach to derive the MLE via argmaxλ
log Q
p(Zi
; λ).
(b) [5pt] The Poisson regression model is based on considering the λ parameter of the Poisson
distribution a function of x and a weight w, which should be determined. The relation is
specified by
log λ(x; w) = w>x.
Here the logarithm of the rate has a linear model, as opposed to the rate itself. This ensures
that the rate is always non-negative.
This rate function leads to the following statistical model for target y ∈ {0, 1, . . . } conditioned on the input x:
p(y|x; w) = exp(yw>x) · exp(−e
w>x
)
y!
.
3
Derive the maximum likelihood function of this model given a dataset consisting of i.i.d.
input and response variables {(x
(1), y(1)), . . . ,(x
(N)
, y(N)
)}. Note that the resulting estimator
does not have a closed form solution, so you only have to simplify the resulting loss function
as far as possible.
Locally Weighted Regression is a non-parametric algorithm, that is, the model does
not learn a fixed set of parameters as is done in regression with a linear model. Rather,
parameters are computed individually for each data point x. The next two questions help
you derive the locally weighted regression.
(c) [5pt] The weighted least squares cost uses positive weights a1, . . . , αN per datapoint to
construct a parameter estimate of the form:
w∗ ← argmin
w
1
2
X
N
i=1
a
(i)
i
(y
(i) − w>x
(i)
)
2
.
Show that the solution to this optimization problem is given by the formula
w∗ =
X>AX−1
X>Ay,
where X is the design matrix (defined in class) and A is a diagonal matrix where Aii = a
(i)
(d) [5pt] Locally weighted least squares combines ideas from k-NN and linear regression. For
each query x, we first compute distance-based weights for each training example
a
(i)
(x) =
exp
−
kx−x
(i)k
2
2τ
2
PN
j=1 exp
−
kx−x(j)k
2
2τ
2
for some temperature parameter τ > 0. We then construct a local solution
w∗
(x) ← argmin
w
1
2
X
N
i=1
a
(i)
(x)
y
(i) − w>x
(i)
2
.
The prediction then takes the form ˆy = w∗
(x)
>
x. How does this algorithm behave as the
temperature τ → 0? What happens as τ → ∞? What is the disadvantage of this approach
compared to the least squares regression in terms of computational complexity?
4
3. Implementing Regression Methods in Python – 55 pts. This question will take
you step-by-step through implementing and empirically studying several regression methods on
the Capital Bikesharing dataset. We are interested in predicting the number of used bikes
on a per-hour basis based on calendar features (e.g., time, weekday, holiday) and environmental
measurements (e.g., humidity, temperature).
We ask that you provide us with both a Jupyter notebook source file as well as an exported
PDF file of the notebook. Your code needs to be functional once downloaded. Non-functional code
will result in losing all marks for this question. If your code is non-modular or otherwise difficult
to understand, you risk losing a significant portion of the marks, even if the code is functional.
Environment setup: For this question you are strongly encouraged to use the following Python
packages:
• sklearn
• matplotlib
• numpy
• pandas
• autograd
Note that you may NOT use the sklearn implementations for least squares regression, Poisson
regression, and locally weighted least squares regression since that would trivialize the exercise.
You may however use your code from the gradient descent tutorial.
3.1. Initial data analysis – 15 pts.
(a) [0pt] Load the Capital Bikesharing Dataset from the downloaded hour.csv file as a pandas
dataframe.
(b) [2pt] Describe and summarize the data in terms of number of data points, dimensions, and
used data types.
(c) [5pt] Present a single grid containing plots for each feature against the target. Choose the
appropriate axis for dependent vs. independent variables.
Hint: use the pyplot.tight_layout function to make your grid readable.
(d) [5pt] Perform a correlation analysis on the data and plot the correlation matrix as a colored
image. State which feature is the most positively, most negatively, and least correlated with
the target column cnt.
(e) [1pt] Drop the following columns from the dataframe: instant, atemp, registered, casual,
dteday.
(f) [2pt] Shuffle the dataframe’s rows using sklearn.utils.shuffle with random state 0.
Split the data into a training set and a test set on index 10000.
3.2. Regression implementations – 40 pts.
(a) [5pt] Implement the ordinary least squares regression algorithm as discussed in class. You
are free to implement the closed form solution.
5
(b) [3pt] Fit the ordinary least squares regression model to the pre-processed data. Report
the coefficient of determination, also known as the R2
score, of your model. The R2
score
between the correct target labels y and the predicted target labels ˆy can be calculated as:
R
2
(y, yˆ) = 1 −
Pn
i=1(yi − yˆi)
2
Pn
i=1(yi − (
Pn
i=1 yi))2
Hint: you may use sklearn.metrics.r2 score.
(c) [5pt] You will find that the fit of the model is not very good. This is in part due to the fact
that the dataset contains categorical input variables. So far, your implementation uses these
categorical features as continuous inputs, which leads to not so good performance. Instead, it
is advised to explicitly encode these input variables as categorical variables. Recall that one
way of dealing with categorical variables is to replace them with 1-hot-encoded vectors. The
following columns in the dataframe are known to be categorical: season, mnth, hr, weekday,
weathersit. Substitute these features with 1-hot-encoded vectors in the dataframe. Use this
dataframe for all upcoming questions. Make sure that you split the dataframe as described
in Question 3.1 (f).
Hint: use pandas.get dummies.
(d) [2pt] Re-fit the model with the new data and report the updated R2
score.
(e) [5pt] Implement the locally weighted regression algorithm as described in Question 2.
(f) [5pt] Fit the locally weighted regression model to the data with τ = 1. Report the R2
score
of your model. Verify whether and describe how the expected behaviour for τ → 0 and
τ → ∞ as described in your answer to Question 2 holds.
For this question, you may use a reduced test set with the first 200 samples from the full
test set. If you do so, please mark this clearly on your solution.
(g) [2pt] Plot a histogram of the target variable. What distribution does the target follow?
(h) [5pt] Implement the Poisson regression algorithm as describe in Question 2.
Since the maximum likelihood estimate is not solvable in closed form, you will need to implement gradient descent. You may use autograd to compute the gradient of the loss function,
but implement the gradient descent procedure by hand as presented in the tutorial.
(i) [3pt] Fit the Poisson model to the data. Report the fraction of explained Tweedie deviance,
also known as the D score, of your model. The D score between the correct target labels y
and the predicted target labels ˆy can be calculated as:
D(y, yˆ) = 1
n
Xn
i=1
2(yi
log(yi/yˆi) + ˆyi − yi)
Hint: you may use sklearn.metrics.d2 tweedie score with power=1.
6
(j) [5pt] For Linear Regression and Poisson Regression, report the final weights. Which are
the most and least significant features in each model? Justify your answer. If the feature
is categorical, report both the feature name, as well as the 1-hot index. Write a small
explanation why visualizing weight contribution is not meaningful for the locally weighted
linear regression.
7