$30
DS 303 Homework 2
Instructions: Homework is to be submitted on Canvas by the deadline stated above. Please
clearly print your name and student ID number on your HW.
Show your work (including calculations) to receive full credit. Please work hard to make your
submission as readable as you possibly can - this means no raw R output or code (unless
it is asked for specifically or needed for clarity).
Code should be submitted with your homework as a separate file (for example, a
.R file, text file, Word file, or .Rmd are all acceptable). You should mark sections of the
code that correspond to different homework problems using comments (e.g. ##### Problem 1
#####).
Problem 1: Multiple Testing Problem
Design and implement a simulation study to illustrate the multiple testing problem. Generate 1000
observations for 200 predictors (X1, X2, . . . , X200). Then generate 1000 Y observations such that
Y has a relationship with only 5 of the 200 predictors. Explicitly:
Yi = β0 + β1X1i + β2X2i + β3X3i + β4X4i + β5X5i + i (i = 1, . . . , n), i ∼ N (0, σ2
).
Decide on the values the parameters and report them (do not forget σ). Fit a multiple linear
regression model on all 200 predictors and report the number of individual t-tests that are significant
at α = 0.05. Use this example to explain (in plain language, no statistics terminology), why we
cannot depend on individual t-tests to tell us whether or not there is a relationship between at least
one of the predictors and the response Y . Discuss the implications of the multiple testing problem
on real applications outside the context of supervised learning. What tools are available to us to
resolve this issue? Please make sure to submit your R code to receive full credit.
Problem 2: Review of regression concepts
Evaluate if the following statements are true or false and justify your answer.
a. When asked to state the true population regression model, a fellow student writes it as follows:
E(Yi) = β0 + β1xi + i (i = 1, . . . , n).
b. The RSS (defined as Pn
i=1(Yi − Yˆ
i)
2
) must increase every time we add a predictor to the
model.
c. For a given dataset, the training MSE will always be smaller than the test MSE.
DS 303: Homework 2 1 Fall 2021
d. The expected test MSE is defined as: E(y0 − ˆf(x0))2
. Here y0 is from our training set and
ˆf() is the model we built from our training set. We evaluate ˆf(x0) on the x0 values from our
test set.
e. The bias-variance decomposition tells us that sometimes reducing the complexity of our model
(for example, removing a predictor), can actually improve our expected test MSE.
f. When carrying out a hypothesis test, we (the user) set the type I error we’re willing to accept.
Problem 3: Statistical Inference
For this problem, we will use the Carseats data set which is part of the ISLR2 package. To access
the data set, load the ISLR2 package into your R session:
library(ISLR2) #you will need to do this every time you open a new R session.
To get a snapshot of the data, run head(Carseats). To find out more about the data set, we can
type ?Carseats.
We will now try to predict carseat unit sales (in thousands) using the other variables in this data
set.
a. Fit a multiple linear regression model to predict carseat unit sales (in thousands) using all
other variables as your predictors. What are the least-square estimates and their standard
errors? Summarize your output in a table.
b. Assume that our random errors (i) are normally distributed. Carry out the F-test at α =
0.05. Write out the null/alternative hypothesis, test statistic, null distribution, p-value, and
conclusion.
c. Choose one regression coefficient and test whether it is zero or not at α = 0.05. Write out
the null/alternative hypothesis, test statistic, null distribution, p-value, and conclusion.
d. Obtain an estimate for σ
2
.
e. Interpret the R2
from the fitted model.
f. Interpret the regression coefficients associated with Shelving Location.
g. Use the model to predict carseat unit sales when the price charged by competitor is average
(you’ll need to find what the average competitor price is), median community income level,
advertising is 15, population is 500, price for car seats at each site is 50, shelving location is
good, average age of local population is 30, education level is 10, and the store is in an urban
location within the US. What is your prediction for Y given these predictors? Construct an
appropriate interval to quantify the uncertainty surrounding this prediction. Set α = 0.01.
h. Use the model to predict carseat unit sales when the price charged by competitor is average
(you’ll need to find what the average competitor price is), median community income level,
advertising is 15, population is 500, price for car seats at each site is 50, shelving location is
good, average age of local population is 30, education level is 10, and the store is in an urban
location within the US. What is your estimate for f(X) given these predictors? Construct an
appropriate interval to quantify the uncertainty surrounding this estimation. Set α = 0.01.
DS 303: Homework 2 2 Fall 2021
i Compare your results in (g) and (h). What do you observe? Explain why. Your explanation
should include a discussion of reducible and irreducible error.
j Obtain the predicted carseat unit sales for all the same predictor values as in part (g), but
set the price for car seats at each site to be 450. What is your prediction for Y ? Does this
make sense? Discuss how this reveals the limitations of our model.
DS 303: Homework 2 3 Fall 2021