Instruction: This assignment consists of 4 problems. If you cannot make it to class, please leave the assignment under the door at Whitehead Hall 306E and email the course instructor. If possible, please type up your assignments, preferably using LATEX. Problem 1: (10pts) Let y = Xβ + ? be a linear model where X is of size n × p and the error terms ? are independent, normally distributed with mean 0 and variance σ 2 . Suppose furthermore that the columns of X can be partitioned as X = W Z where W is of size n × q and is of full-column rank and Z is of size n × (p − q) and is of full-column rank, for some q satisfying 1 ≤ q ≤ p, and that WT Z = 0. We now partition β as β = h β1 β2 i where β1 is of size q × 1 and β2 is of size (p − q) × 1. Let βˆ = h βˆ1 βˆ2 i be the least square estimate of β. (a) Show that βˆ 1 = (WTW) −1WT y and βˆ 2 = (Z T Z) −1Z T y. (b) Show that βˆ 1 and βˆ 2 are independent. (c) Let a be a q×1 vector and b be a (q−p)×1 vector. Let (l1, u1) and (l2, u2) be the individual 95% confidence intervals for a T β1 and b T β2 based on βˆ 1 and βˆ 2, respectively. Is the confidence interval (l1, u1) independent of the confidence interval (l2, u2) ? Justify your answer. Problem 2: (10pts) Let W be a n×p matrix and X be a n×q matrix and that C(W) ⊆ C(X). Denote by PX and PW the symmetric idempotent matrices projecting onto C(X) and 1 C(W), respectively. Show that PX−PW is the symmetric orthogonal projection onto C((I − PW)X). You can do it by arguing as follows. • First show that PX − PW is idempotent. Hint: PXPWz = PWz for all z; in addition PWPX = (PXPW) . • Next, show that for any z, (PX − PW)z ∈ C((I − PW)X). Hint: Since C(X) and N (X) are orthogonal complements, any vector z ∈ R n can be written as z = Xv + w for some vectors v and some w ∈ N (X); what is the relationship between N (W) and N (X) ?. • Finally, show that if z ∈ C((I − PW)X) then (PX − PW)z = 0. Problem 3: (20pts) The kidiq.dta dataset is available from the url http://www.stat.columbia. edu/~gelman/arm/examples/child.iq/kidiq.dta accompanying the book “Data Analysis using Regression and Multilevel/Hierarchical Models” by Gelman and Hill. The dataset contains observations from a sample of 434 children. The variables include the child cognitive test scores at age 3 or 4, whether the mother finishes high school (coded as 1) or not (coded as 0), mother’s IQ, age of mother at child’s birth, and whether the mother work or not in the first three years of child’s life. More specifically, the variable mom.work takes on the value • mom.work = 1 if mother did not work in first three years of child’s life • mom.work = 2 if mother worked in second or third year of child’s life • mom.work = 3 if mother worked part-time in first year of child’s life • mom.work = 4 if mother worked full-time in first year of child’s life After downloading the kidiq.dta file you can read the data into R using the following snippet of code library("foreign") iq.data <- read.dta("kidiq.dta") Using this dataset, answer the following questions. (a) Perform a regression with kid score as the response variable and the remaining variable except mom hs as predictor variables. (b) Provide a quick discussion regarding the coefficients for the predictor variables. What do they say ? (c) Using the model in part [(a)], test the hypothesis that the predictor variables mom work and mom age is associated with the response variable. When do you recommend mothers should give birth ? What are your assumption for making this recommendation ? 2 (d) What happens when you add mom hs as a predictor variable to the model in part (a) ? Have your conclusion about the timing of birth changed ? (e) Using the model in part (d), perform some diagnostics, e.g., check the constant variance assumption, normality of errors. Look for outliers, influential points, and points with high leverage. (f) Consider augmenting the model in part (d) with one whose predictor variables include interactions between say mom.hs and mom age or interactions between say mom.work and mom age. Write down the “formula” for the resulting model and discuss how it differs from the “formula” for the model in part (d). Test the hypothesis that the interaction term in the augmented model is not significant. Problem 4: (20pts) The link http://www.amstat.org/publications/jse/v16n3/kuiper.xls is a dataset collected from the Kelly Blue Book for several hundred 2005 used GM cars. Do something with this data. This is meant to be an open-ended question. For some ideas of the kind of analysis one can attempt, see the article http://www.amstat.org/publications/ jse/v16n3/datasets.kuiper.html. 3