$30
DS 303 Homework 4
Instructions: Homework is to be submitted on Canvas by the deadline stated above. Please
clearly print your name and student ID number on your HW.
Show your work (including calculations) to receive full credit. Please work hard to make your
submission as readable as you possibly can - this means no raw R output or code (unless
it is asked for specifically or needed for clarity).
Code should be submitted with your homework as a separate file (for example, a
.R file, text file, Word file, or .Rmd are all acceptable). You should mark sections of the
code that correspond to different homework problems using comments (e.g. ##### Problem 1
#####).
Problem 1: Best subset selection
The data for this problem comes from a study by Stamey et al. (1989). They examined the
relationship between the level of prostate-specific antigen and a number of clinical measures in men
who were about to receive a radical prostatectomy. The variables are log cancer volume (lcavol),
log prostate weight (lweight), age, log of the amount of benign prostatic hyperplasia (lbph),
seminal vesicle invasion (svi), log of capsular penetration (lcp), Gleason score (gleason), and
percent of Gleason scores 4 or 5 (ppp45). The last column corresponds to which observations were
used in the training set and which were used in the test set (train).
Read in the prostate data set using the following code:
prostate = read.table(‘.../prostate.data’,header=TRUE)
In place of ‘...’, specify the pathway where you saved the dataset.
Our response of interest here is the log prostate-specific antigen (lpsa). We will use this data set
to practice 3 common subset selection approaches.
a. Approach 1: Perform best subset selection on the entire data set with lpsa as the response.
For each model size, you will obtain a ’best’ model (size here is just the number of predictors
in the model): M1 is the best model with 1 predictor (size 1), M2 is the best model with 2
predictors (size 2), and so on. Create a table of the AIC, BIC, adjusted R2 and Mallow’s Cp
for each model size. Report the model with the smallest AIC, smallest BIC, largest adjusted
R2 and smallest Mallow’s Cp. Do they lead to different results? Using your own judgement,
choose a final model.
b. Approach 2: The dataset has already been split into a training and test set. Construct your
training and test set based on this split. You may use the following code for convenience:
train = subset(prostate,train==TRUE)[,1:9]
test = subset(prostate,train==FALSE)[,1:9]
DS 303: Homework 4 1 Fall 2021
For each model size, you will obtain a ‘best’ model. Fit each of those models on the training
set. Then evaluate the model performance on the test set by computing their test MSE.
Choose a final model based on prediction accuracy. Fit that model to the full dataset and
report your final model here.
c. Approach 3: This approach is used to select the optimal size, not which predictors will end
up in our model. Split the dataset into k folds (you decide what k should be). We will perform
best subset selection within each of the k training sets. Here are more detailed instructions:
i. For each fold k = 1, . . . , K:
1. Perform best subset selection using all the data except for those in fold k (training
set). For each model size, you will obtain a ‘best’ model.
2. For each ‘best’ model, evaluate the test MSE on the data in fold k (test set).
3. Store the test MSE for each model.
Once you have completed this for all k folds, take the average of your test MSEs for
each model size. In other words, for all k models of size 1, you will compute their kfold cross-validated error. For all the k models of size 2, you will compute their k-fold
cross-validated errors, and so on. Report your 8 CV errors here.
ii. Choose the model size that gives you the smallest CV error. Now perform best subset
selection on the full data set again in order to obtain this final model. Report that
model here. (For example, suppose cross-validation selected a 5-predictor model. I
would perform best subset selection on the full data set again in order to obtain the
5-predictor model.)
DS 303: Homework 4 2 Fall 2021
Problem 2: Cross-validation
a. Explain how k-fold cross-validation is implemented.
b. What are the advantages and disadvantages of k-fold cross-validation relative to:
i. The validation set approach?
ii. LOOCV?
c. For the following questions, we will perform cross-validation on a simulated data set. Generate
a simulated data set such that Y = X − 2X2 + , ∼ N(0, 1
2
). Fill in the following code:
set.seed(1)
x = rnorm(100)
error = ??
y = ??
d. Set a random seed, and then compute the LOOCV errors that result from fitting the following
4 models using least squares:
M1 : a linear model with X
M2 : a polynomial regression model with degree 2
M3 : a polynomial regression model with degree 3
M4 : a polynomial regression model with degree 4
You may find it helpful to use the data.frame() function to create a single data set containing
both X and Y .
e. Repeat the above step using another random seed, and report your results. Are your results
the same as what you got in (d). Why?
f. Which of the models in (d) had the smallest LOOCV error? Is this what you expected?
Explain your answer.
g. Comment on the statistical significance of the coefficient estimates that results from fitting
each of the models in (c) using least squares. Do these results agree with the conclusions
drawn based on the cross-validation results?
Problem 3: Forward and backward selection
We will use the College data set in the ISLR2 library to predict the number of applications (Apps
each university received. Randomly split the data set so that 90% of the data belong to the training
set and the remaining 10% belong to the test set. Implement forward and backward selection on
the training set only. For each approach, report the best model based on AIC. From these 2 models,
pick a final model based on their performance on the test set. Report both model’s test MSE and
summarize your final model.
DS 303: Homework 4 3 Fall 2021