$30
Homework 1
E6690: Statistical Learning for Bio & Info Systems
P1. Let
X¯ =
1
n
Xn
i=1
Xi and S
2 =
1
n − 1
Xn
i=1
(Xi − X¯)
2
.
Show that:
(a) (2pt) Pn
i=1 X2
i = (n − 1)S
2 + nX¯ 2
(b) (2pt) If X1, X2, ..., Xn are independent and identically distributed (i.i.d.), the S
2
is an unbiased estimator
of σ
2
, i.e., ES
2 = σ
2
In the following, in addition to the above, assume that Xi-s have normal/Gaussian distribution N (µ, σ2
).
(c) (3pt) Show (prove) that X¯ is independent of Xi − X¯, i = 1, 2, . . . , n.
(Hint: Both X¯ and Xi − X¯ are normal.)
(d) (3pt) Show (prove) that the sample mean, X¯, is independent of the sample variance, S
2
.
P2. (10pt) Show that in the case of simple linear regression between Y and X, the R2
statistic is equal to the
square of the correlation coefficient between X and Y (r
2
). For simplicity, you may assume that y¯ = ¯x = 0.
Recall that
R
2 =
Pn
i=1(ˆyi − y¯)
2
Pn
i=1(yi − y¯)
2
and r =
Pn
i=1(xi − x¯)(yi − y¯)
pPn
i=1(xi − x¯)
2 Pn
i=1(yi − y¯)
2
.
P3. (20pt; each bullet 2pt) Create some simulated data and fit simple linear regression models to it. Make
sure to use set.seed(1) prior to starting part (a) to ensure consistent results.
(a) Using the rnorm() function, create a vector, x, containing 100 observations drawn from a N (0, 1)
distribution. This represents a feature, X.
(b) Using the rnorm() function, create a vector, eps, containing 100 observations drawn from a N (0, 0.25)
distribution.
(c) Using x and eps, generate a vector y according to the model
Y = −1 + 0.5X + .
What is the length of the vector y? What are the values of β0 and β1 in this linear model?
(d) Create a scatterplot displaying the relationship between x and y. Comment on what you observe.
(e) Fit a least squares linear model to predict y using x. Comment on the model obtained. How do βˆ
0 and
βˆ
1 compare to β0 and β1?
(f) Display the least squares line on the scatterplot obtained in (d). Draw the population regression line on
the plot, in a different color. Use the legend() command to create an appropriate legend.
(g) Now fit a polynomial regression model that predicts y using x and x
2
. Is there evidence that the quadratic
term improves the model fit? Explain your answer.
1
(h) Repeat (a)-(f) after modifying the data generation process in such a way that there is less noise in the
data. The model in (c) should remain the same. You can do this by decreasing the variance of the normal
distribution used to generate the error term in (b). Describe your results.
(i) Repeat (a) − (f) after modifying the data generation process in such a way that there is more noise in
the data. The model in (c) should remain the same. You can do this by increasing the variance of the
normal distribution used to generate the error term in (b). Describe your results.
(j) What are the confidence intervals for β0 and β1 based on the original data set, the noisier data set, and
the less noisy data set? Comment on your results. (You could use the confint() function.)
P4. (10pt) Using R and Advertising data set, find 92% confidence intervals for β0 and β1 for three singlefeature linear regressions of Sales versus Newspaper, TV and Radio, respectively. Then, create a scatterplot
for each of them with the 92% confidence interval lines, i.e., draw the lines that correspond to the ends of
confidence intervals for (β0, β1). The answer should include the R code and graphs.
P5. Consider the Auto data set:
(a) (5pt) Produce a scatterplot matrix which includes all of the pairs of variables in the data set.
(b) (5pt) Compute the matrix of correlations between the variables using the function cor(). You will need
to exclude the name variable, which is qualitative.
(c) (5pt) Use the lm() function to perform a multiple linear regression with mpg as the response and all other
variables except name as the predictors. Use the summary() function to print the results. Comment on
the output. For instance:
i. Is there a relationship between the predictors and the response?
ii. Which predictors appear to have a statistically significant relationship to the response?
iii. What does the coefficient for the year variable suggest?
(d) (5pt) Try a few different transformations of the variables, such as log(X),
√
X, X2
. Comment on your
findings.
P6. (10pt) A data set has n = 20,
X
20
i=1
xi = 8.552,
X
20
i=1
yi = 398.2,
X
20
i=1
x
2
i = 5.196,
X
20
i=1
y
2
i = 9356, and X
20
i=1
xiyi = 216.6.
Calculate βˆ
0, βˆ
1 and σˆ
2
. What is the fitted value when x = 0.5? Compute R2
.
P7. (10pt) The multiple linear regression model
y = β0 + β1x1 + β2x2 + β3x3 + β4x4 + β5x5 + β6x6
is fitted to a data set of n = 45 observations. The total sum of squares is TSS = 11.62, and the residual sum
of squares is RSS = 8.95. What is the p-value for the null hypothesis
H0 : β1 = β2 = β3 = β4 = β5 = β6 = 0 ?
2
Extra Credit
Under normal assumptions we can compute the distributions of a lot of quantities explicitly.
E1. (5pt) Chi-squared distribution. Let X1, X2, . . . , Xn be independent standard normal random variables and
recall that Chi-squared random variable with n degrees of freedom is defined as χ
2
n = X2
1 + X2
2 + · · · + X2
n
.
Prove that the density of χ
2
n
is given by
gn(x) = 1
Γ(n/2)2n/2
x
n/2−1
e
−x/2
,
where Γ(x) is the gamma function. (Hint: Prove first for n = 1, 2, and then use the mathematical induction.)
E2. (5pt) Let X1, X2, . . . , Xn be independent normal random variables N (µ, σ2
). Prove that
(n − 1)S
2
σ
2
d= χ
2
n−1
,
where d= stands for equality in distribution.
(Hint: Derive the moment generating function of χ
2
n
and use problem P1.(a) and (d).)
E3. (5pt) Student’s t distribution. Let tn be student’s t variable, defined as
tn =
Z
p
χ2
n/n
,
where Z ∼ N (0, 1). Prove that tn has the density
fn(t) = Γ((n + 1)/2)
√
πnΓ(n/2) ·
1
(1 + t
2/n)
(n+1)/2
,
where Γ(x) is the gamma function. Show that for large values of n, fn(t) is approximately normal, fn(t) ≈
e
−t
2
/
√
2π. (Hint: First show that the conditional density (distribution) of tn given χ
2
n = x is normal with mean
0 and variance p
n/x. Then, use problem E1. to integrate this conditional density.)
E4. (5pt) F (Fisher) distribution. Let U and V be two independent Chi-squared random variables with degrees
of freedom n1 and n2, and define the random variable, F ≡ F(n1, n2), as
F =
U/n1
V /n2
.
Show that the density of F is given by
fn1,n2
(w) = (n1/n2)
n1/2Γ[(n1 + n2)/2]w
(n1/2)−1
Γ[n1/2]Γ[n2/2][1 + (n1w/n2)](n1+n2)/2
.
(Hint: Compute first the distribution of F given V .)
3