Homework 1: Linear Regression
Introduction
This homework is on different forms of linear regression and focuses on loss functions, optimizers, and regularization. Linear regression will be one of the few models that we see that has an analytical solution. These problems focus on deriving these solutions and exploring their properties. If you find that you are having trouble with the first couple problems, we recommend going over the fundamentals of linear algebra and matrix calculus. We also encourage you to first read the Bishop textbook, particularly: Section 2.3 (Properties of Gaussian Distributions), Section 3.1 (Linear Basis Regression), and Section 3.3 (Bayesian Linear Regression). (Note that our notation is slightly different but the underlying mathematics remains the same :). Please type your solutions after the corresponding problems using this L ATEX template, and start each problem on a new page. You will submit your solution PDF and at most 1 .py file per problem (for those that involve a programming component) to Canvas.
Problem 1 (Priors and Regularization,15pts) In this problem we consider a model of Bayesian linear regression. Define the prior on the parameters as, p(w) = N(w|0,τ−1 w I), where τw is as scalar precision hyperparameter that controls the variance of the Gaussian prior. Define the likelihood as,
p(y|X,w) =
n Y i=1
N(yi|wTxi,τ−1 n ),
where τn is another fixed scalar defining the variance.
(a) Using the fact that the posterior is the product of the prior and the likelihood (up to a normalization constant), i.e.,
argmax w
lnp(w|y,X) = argmax w
lnp(w) + lnp(y|X,w).
Show that maximizing the log posterior is equivalent to minimizing a regularized loss function given by L(w) + λR(w), where
L(w) =
1 2
n X i=1 (yi −wTxi)2
R(w) =
1 2
wTw
Do this by writing lnp(w|y,X) as a function of L(w) and R(w), dropping constant terms if necessary. Conclude that maximizing this posterior is equivalent to minimizing the regularized error term given by L(w) + λR(w) for a λ expressed in terms of the problem’s constants. (b) Notice that the form of the posterior is the same as the form of the ridge regression loss
L(w) = (y−Xw)(y−Xw) + λww. Compute the gradient of the loss above with respect to w. Simplify as much as you can for full credit. Make sure to give your answer in vector form. (c) Suppose that λ 0. Knowing that L is a convex function of its arguments, conclude that a global optimizer of L(w) is w = (XX + λI)−1Xy (1)
For this part of the problem, assume that the data has been centered, that is, pre-processed such that 1 nPn i=1 xij = 0. (d) What might happen if the number of weights in w is greater than the number of data points N? How does the regularization help ensure that the inverse in the solution above can be computed?
Solution
2. Modeling Changes in Congress
The objective of this problem is to learn about linear regression with basis functions by modeling the number of Republicans in the Senate. The file data/year-sunspots-republicans.csv contains the data you will use for this problem. It has three columns. The first one is an integer that indicates the year. The second is the number of sunspots. The third is the number of Republicans in the Senate. The data file looks like this:
Year,Sunspot_Count,Republican_Count 1960,112.3,36 1962,37.6,34 1964,10.2,32 1966,47.0,36 1968,105.9,43 1970,104.5,44
and you can see plots of the data in Figures ?? and ??.
Figure 1: Number of Republicans in the Senate. The horizontal axis is the year, and the vertical axis is the number of Republicans.
Figure 2: Number of sunspots by year. The horizontal axis is the year, and the vertical axis is the number of sunspots.
Data Source: http://www.realclimate.org/data/senators_sunspots.txt
Problem 2 (Modeling Changes in Republicans and Sunspots, 15pts) Implement basis function regression with ordinary least squares for years vs. number of Republicans in the Senate. Some sample Python code is provided in linreg.py, which implements linear regression. Plot the data and regression lines for the simple linear case, and for each of the following sets of basis functions (only use (b) for years, skip for sunspots):
(a) φj(x) = xj for j = 1,...,5 (b) φj(x) = exp −(x−µj)2 25 for µj = 1960,1965,1970,1975,...2010 (c) φj(x) = cos(x/j) for j = 1,...,5
(d) φj(x) = cos(x/j) for j = 1,...,25
In addition to the plots, provide one or two sentences for each with numerical support, explaining whether you think it is fitting well, overfitting or underfitting. If it does not fit well, provide a sentence explaining why. A good fit should capture the most important trends in the data.
Next, do the same for the number of sunspots vs. number of Republicans, using data only from before 1985. What bases provide the best fit? Given the quality of the fit, would you believe that the number of sunspots controls the number of Republicans in the senate?
Solution
Problem 3 (BIC, 15pts) Adapted from Biophysics : Searching for Principles by William Bialek.
Consider data D = {(xi,yi)} where we know that yi = f(xi;a) + i where f(xi;a) is a function with parameters a, and i ∼N(0,σ2) is an additive noise term. We assume that f is a polynomial with coefficients a. Consider the class of all polynomials of degree K. Each of these polynomials can be viewed as a generative model of our data, and we seek to choose a model that both explains our current data and is able to generalize to new data. This problem explores the use of the Bayesian Information Criterion (BIC) for model selection. Define the χ2 (chi-squared) error term for each model of the class as:
χ2 = 1 σ
N X i=1
yi −
K X j=0
ajxj i 2 Using this notation, a formulation of the BIC can be written as:
−lnP(xi,yi|model class) ≈
N 2
ln(2πσ2)−
N X i=1
lnp(xi) +
1 2
χ2min + K + 1 2
lnN
where χ2min(K) denote the minimum value of the χ2 over the set of polynomial models with K parameters. Finally, assume that xi ∼ Unif(−5,5) and that each aj ∼ Unif(−1,1). Let Ktrue = 10. (a) Write code that generates N data points in the following way: 1. Generate a polynomial f(x) =PKtrue j=0 ajxj 2. Sample N points xi 3. Compute yi = f(xi) + i where is sampled from N(0,σ2) with σ = maxi f(xi)−mini f(xi) 10 . (b) For a set of yi generated above and a given K, write a function that minimizes χ2 for a polynomial of degree K by solving for a using numpy polyfit. Check for N = 20 that χ2min(K) is a decreasing function of K.
(c) For N = 20 samples, run 500 trials. This involves generating a new polynomial for each trial, then from that polynomial, 20 sample data points {(xi,yi)}. For each trial, we can calculate the optimal K by minimizing BIC. Compute the mean and variance of the optimal K over 500 trials. (d) For N ranging from 3 to 3·104 on a log scale (you can use the function 3∗np.logspace(0,4,40) as your values of N), compute the mean and variance of the optimal K over 500 trials for each N. Plot your results, where the x-axis is the number of samples (N) on a log-scale, and the y-axis is the mean value of the optimal K with error bars indicating the variance over 500 trials. Verify that minimizing the BIC controls the complexity of the fit, selecting a nontrivial optimal K. You should observe that the optimal K is smaller than Ktrue for small data sets, and approaches Ktrue as you analyze larger data sets.
Solution
Problem 4 (Calibration, 1pt) Approximately how long did this homework take you to complete?
Answer: