$30
STAT 435
Homework # 6
Online Submission Via Canvas
Instructions: You may discuss the homework problems in small groups, but you
must write up the final solutions and code yourself. Please turn in your code for the
problems that involve coding. However, for the problems that involve coding, you
must also provide written answers: you will receive no credit if you submit code without written answers. You might want to use Rmarkdown to prepare your assignment.
1. For this problem, you will analyze a data set of your choice, not taken from
the ISLR package. I suggest choosing a data set that has p ≈ n or even p n,
since you will apply methods from Chapter 6 on this data.
(a) Describe the data in words. Where did you get it from, and what is the
data about? You will perform supervised learning on this data, so you
must identify a response, Y , and features, X1, . . . , Xp. What are the values
of n and p? Describe the response and the features (e.g. what are they
measuring; are they quantitative or qualitative?). Plot some summary
statistics of the data.
(b) Split the data into a training set and a test set. What are the values of n
and p on the training set?
(c) Fit a linear model using least squares on the training set, and report the
test error obtained.
(d) Fit a ridge regression model on the training set, with λ chosen by crossvalidation. Report the test error obtained.
(e) Fit a lasso model on the training set, with λ chosen by cross-validation.
Report the test error obtained, along with the number of non-zero coefficient estimates.
(f) Fit a principal components regression model on the training set, with M
chosen by cross-validation. Report the test error obtained, along with the
value of M selected by cross-validation.
(g) Fit a partial least squares model on the training set, with M chosen by
cross-validation. Report the test error obtained, along with the value of
M selected by cross-validation.
1
(h) Comment on the results obtained. How accurately is the best model you
obtained, in terms of test error? Is there much difference among the test
errors resulting from these approaches? Which model do you prefer?
2. Define the basis functions b1(X) = I(−1 < X ≤ 1) − (2X − 1)I(1 < X ≤ 3),
b2(X) = (X + 1)I(3 < X ≤ 5) − I(5 < X ≤ 6). We fit the linear regression
model
Y = β0 + β1b1(X) + β2b2(X) + ?,
and obtain coefficient estimates βˆ
0 = 2, βˆ
1 = −1, βˆ
2 = 2. Sketch the estimated
curve between X = −3 and X = 8. Note the intercepts, slopes, and other
relevant information.
2