$35
Regression Assignment
Assignment Overview
You will perform 3 regression analyses in this assignment. For each analysis you will
• Fit the simple, and multivariable model.
• Test for a potential confounder Z.
• Interpret ALL regression coefficients (including the intercept) of the multivariable model.
Remember, to test if Z confounds the relationship between X and Y , the relationship between
X and Y must be significant
1. Multiple Linear Regression: Q ∼ Q + Z
• One quantitative and a potential binary confounder.
2. Logistic Regression: logit(B) ∼ Q + Z
• One quantitative and a potential binary confounder.
• Your binary response variable Y must be coded as 1 (event) and 0 (non-event).
3. Any of the above analyses (or a new model) with a categorical (more than 2 levels) predictor variable
e.g.: Q ∼ X + C.
• You do NOT need to test a for a potential confounder here.
• Review detailed instructions in the example below. You have to write out the mathematical model.
Instructions
1. Identify variables under consideration
• Determine a third variable Z that you want to test as a potential confounder.
2. Write out the null, alternative, and confounder Hypotheses statements.
• Null - that there is no relationship between response and explanatory variables
• Alternative - that there is a relationship between response and explanatory variables.
• Confounder - that there is a relationship between response and explanatory variables after
controlling for the confounding variable.
3. Fit the simple bivariate model
• Model the response variable on the explanatory variable y ~ x
• Write a simple sentence on whether or not there is a relationship.
• For models where you must test for a confounder, this relationship must be significant.
4. Fit the multivariable model.
• Model the response variable on the explanatory variable and the third variable. y ~ x + z
• Determine if Z is a confounder by looking at the p-value for the explanatory variable.
– If it is still significant, the third variable is not a confounding variable.
– If it is no longer significant, the third variable is a confounding variable. This means that the
third variable is explaining the relationship between the explanatory variable and the response
variable.
• Assess model fit by examining the residuals.
5. Interpret all regression coefficients.
6. Write a conclusion.
1
Multiple Linear Regression
1. Identify variables
If you have a likert “Strongly Agree” to “Strongly Disagree” variable that has at least 5 levels you can treat
it as a pseudo-Quantitative Variable for this assignment.
• Quantitative outcome: Income (variable income).
• Quantitative predictor: Time you wake up in the morning (variable wakeup)
• Binary confounder: Gender (variable female_c)
2. State hypotheses
• Null: There is no relationship between the time you wake up and your personal earnings
• Alternative: There is a relationship between the time you wake up and your personal earnings
• Confounder: There is still a relationship between the time you wake up and your personal earnings,
after controlling for gender.
3. Fit the simple model
Is there a relationship between income and time a person wakes up?
lm.mod1 <- lm(income ~ wakeup, data=addhealth)
summary(lm.mod1)
##
## Call:
## lm(formula = income ~ wakeup, data = addhealth)
##
## Residuals:
## Min 1Q Median 3Q Max
## -31297 -15622 -5622 9245 209865
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 43548.4 1126.2 38.667 < 2e-16 ***
## wakeup -487.7 151.2 -3.225 0.00127 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 24670 on 3812 degrees of freedom
## (2690 observations deleted due to missingness)
## Multiple R-squared: 0.002721, Adjusted R-squared: 0.002459
## F-statistic: 10.4 on 1 and 3812 DF, p-value: 0.001271
The estimate of the regression coefficient for wakeup is significant (b1=-488, p= 0.001). There is reason to
believe that the time you wakeup is associated with your income.
4. Fit the multivariable model
Fit the same multiple linear regression model and include the potential confounding variable. Determine if
the third variable is a confounder.
2
lm.mod2 <- lm(income ~ wakeup + female_c, data=addhealth)
summary(lm.mod2)
##
## Call:
## lm(formula = income ~ wakeup + female_c, data = addhealth)
##
## Residuals:
## Min 1Q Median 3Q Max
## -36047 -15141 -5252 8678 205610
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 48669.4 1206.9 40.325 < 2e-16 ***
## wakeup -611.3 149.4 -4.092 4.37e-05 ***
## female_cFemale -8527.1 789.3 -10.803 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 24300 on 3810 degrees of freedom
## (2691 observations deleted due to missingness)
## Multiple R-squared: 0.03236, Adjusted R-squared: 0.03185
## F-statistic: 63.7 on 2 and 3810 DF, p-value: < 2.2e-16
The relationship between income and wake up time is still significant after controlling for gender. Gender is
not a confounder.
Optional new package stargazer for printing regression models as columns. Great for comparisons, looks like journal articles. Your R code chunk header must look like this: ```{r,
results='asis'} and be sure to use the correct output format: type='html' or type='latex'.
Vignette found at: https://www.jakeruss.com/cheatsheets/stargazer/
library(stargazer)
stargazer(lm.mod1, lm.mod2, type='latex', ci=TRUE, single.row=TRUE, digits=0,
omit.stat="rsq", column.labels=c("SLR", "MLR"), model.numbers=FALSE)
% Table created by stargazer v.5.2.2 by Marek Hlavac, Harvard University. E-mail: hlavac at fas.harvard.edu
% Date and time: Mon, Nov 11, 2019 - 3:43:16 PM
Table 1:
Dependent variable:
income
SLR MLR
wakeup −488∗∗∗ (−784, −191) −611∗∗∗ (−904, −319)
female_cFemale −8,527∗∗∗ (−10,074, −6,980)
Constant 43,548∗∗∗ (41,341, 45,756) 48,669∗∗∗ (46,304, 51,035)
Observations 3,814 3,813
Adjusted R2 0 0
Residual Std. Error 24,666 (df = 3812) 24,297 (df = 3810)
F Statistic 10∗∗∗ (df = 1; 3812) 64∗∗∗ (df = 2; 3810)
Note: ∗p<0.1; ∗∗p<0.05; ∗∗∗p<0.01
3
5. Interpret the regression coefficients of the multivariable model.
• b0: For a male (gender=0) who wakes up at midnight (wakeup=0), their predicted average income is
$48,669.4 (95% CI $46,303.9, $51,035)
• b1: Holding gender constant, for every hour later a person wakes up, their predicted average income
drops by $611 (95% CI $319, $904).
• b2: Controlling for the time someone wakes up in the morning, the predicted average income for females
is $8,527 (95% CI $6,980, $10,074) lower than for males.
6. Template Conclusion
• Use the numerical results from the multivariable model to fill in the values in the conclusion below.
• Look at the Adjusted R2
to see how much of the variance in the response you are accounting for with
the predictor.
Replace the bold words with your variables, the highlighted words with data from your analysis, and
choose between conclusion options.
After adjusting for the potential confounding factor of third variable, explanatory variable
(b1 = parameter estimate, CI confidence interval range, p = significance value) was
*significantly/not significantly and positively/negatively associated with response variable.
Approximately R-Square*100 of the variance of response can be accounted for by explanatory after
controlling for third variable. Based on these analyses, third variable is not/is a confounding factor
because the association between explanatory and response is still/is no longer significant after accounting
for third variable.
So the conclusion for this analysis reads:
After adjusting for the potential confounding factor of gender, wake up time (b1 = -611,
95% CI: (-904, -318), p<.0001) was significantly and negatively associated with income.
Approximately 3.2% of the variance of income can be accounted for by wake up time after
controlling for gender. Based on these analyses, gender is not a confounding factor because the
association between wake up time and income is still significant after accounting for gender.
After wordsmithing/editing for sentence flow,
After adjusting for the potential confounding factor of gender, an adolescent’s weight (Beta =
1.34, 95% CI -0.53, 3.21, p = .1558) was not significantly associated with the number of cigarettes
smoked in the past 30 days. Approximately 0.78% of the variance in cigarettes smoked can
be accounted for by weight after controlling for gender. Based on these analyses, gender is a
confounding factor because the association between weight and cigarettes smoked is no longer
significant after accounting for gender.
Logistic Regression
Your outcome variable must be coded as 1 (event) and 0 (non-event). Recoding this way ensures you are
predicting the presence of your categorical variable and not the absence of it.
1. Identify variables
• Binary outcome: Poverty (variable poverty). This is an indicator if reported personal income is below
$10,210.
• Binary predictor: Ever smoked a cigarette (variable eversmoke_c)
• Binary confounder: Gender (variable female_c)
4
2. State hypotheses
• Null hypothesis: There is no relationship between the probability of living below the poverty level and
the time you wake up in the morning.
• Alternative hypothesis: There is a relationship between the probability of living below the poverty level
and the time you wake up in the morning.
• Confounding hypothesis: There still is a relationship between the probability of living below the poverty
level and the time you wake up in the morning after controlling for gender.
3. Fit the simple model
Fit the logistic regression model (a.k.a generalized linear model) of the explanatory variable on the response
variable. Decide to reject the null hypothesis in favor of the alternative.
log.mod.1 <- glm(poverty~eversmoke_c, data=addhealth, family='binomial')
summary(log.mod.1)
##
## Call:
## glm(formula = poverty ~ eversmoke_c, family = "binomial", data = addhealth)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.7064 -0.7064 -0.7064 -0.6210 1.8659
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.54800 0.06444 -24.021 < 2e-16 ***
## eversmoke_cSmoked at least once 0.28711 0.07737 3.711 0.000207 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 4907.4 on 4834 degrees of freedom
## Residual deviance: 4893.3 on 4833 degrees of freedom
## (1669 observations deleted due to missingness)
## AIC: 4897.3
##
## Number of Fisher Scoring iterations: 4
The p-value for the b1 estimate of the regression coefficient for eversmoke_c is significant at 0.0002. There is
reason to believe that smoking status is associated with the probability of living below the poverty level.
4. Fit the multivariable model
Fit the same logistic regression model and include the potential confounding variable. This is only done if
there is a significant relationship between the explanatory and response variable. Determine if
the third variable is a confounder.
log.mod.2 <- glm(poverty~eversmoke_c + female_c, data=addhealth, family='binomial')
summary(log.mod.2)
##
5
## Call:
## glm(formula = poverty ~ eversmoke_c + female_c, family = "binomial",
## data = addhealth)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.8335 -0.7048 -0.5652 -0.4716 2.1221
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.14045 0.08662 -24.71 < 2e-16 ***
## eversmoke_cSmoked at least once 0.38725 0.07886 4.91 9.09e-07 ***
## female_cFemale 0.87450 0.07690 11.37 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 4906.9 on 4833 degrees of freedom
## Residual deviance: 4755.4 on 4831 degrees of freedom
## (1670 observations deleted due to missingness)
## AIC: 4761.4
##
## Number of Fisher Scoring iterations: 4
The p-value for the regression coefficient estimate of eversmoke_c is still significant at <.0001 after controlling
for gender. Thus gender is not a confounder.
5. Interpret the Odds Ratio estimates
Below I create a table containing the odds ratio estimates and 95% CI for those estimates using the
multivariable model.
# For your assignment - replace the saved model object `log.mod.2` with whatever YOU named this model.
kable(
data.frame(
OR = exp(coef(log.mod.2)),
LCL = exp(confint(log.mod.2))[,1],
UCL = exp(confint(log.mod.2))[,2]
),
digits=2, align = 'ccc')
OR LCL UCL
(Intercept) 0.12 0.10 0.14
eversmoke_cSmoked at least once 1.47 1.26 1.72
female_cFemale 2.40 2.06 2.79
• After controlling for gender, smokers have 1.5 (1.3, 1.7) times the odds of reporting making below the
federal poverty level compared to non smokers.
• After controlling for smoking status, females have 2.4 (2.1, 2.8) time the odds of reporting annual
earned wages below the federal poverty level compared to males.
6
6. Template Conclusion
Replace the bold words with your variables, the highlighted words with data from your analysis, and
choose between conclusion options.
After adjusting for the potential confounding factor of third variable, explanatory variable (OR odds
ratio estimate, CI confidence interval range, p = significance value) was significantly/not significantly and positively/negatively associated with the likelihood of response variable. In this analysis, the
odds ratio tells us that those who are [describe what dummy code 1 of your explanatory variable
means here] are 0.05 times more (if OR greater than 1)/less (if OR less than 1) likely to [describe what
dummy code 1 of your response variable means here]. Based on these analyses, third variable is
not/is a confounding factor because the association between explanatory and response is still/is no longer
significant after accounting for third variable.
So the conclusion for this analysis reads:
After adjusting for the potential confounding factor of gender, smoking status (1.47, CI 1.26-1.72,
p <.0001) was significantly and positively associated with the likelihood of earning under the poverty
level. In this analysis, the odds ratio tells us that those who have ever smoked are 1.47 times more likely
to earn income below the federal poverty level. Based on these analyses, gender is not a confounding
factor because the association between smoking and poverty status is still significant after accounting for
gender.
Categorical predictors
For any of the regression models above, or a new model if you choose, add a categorical variable with more
than 2 levels.
1. Identify variables
• Outcome: BMI (variable BMI). This is a quantitiative measure.
• Predictor: Income (variable income). This is a quantitative measure.
• Predictor: general health (variable genhealth). This is a categorical measure with levels: Excellent
(reference), Very good, Good, Fair and Poor.
2. Write the mathematical model.
Define what each x is, and write the mathematical model. State what group is the reference group.
• Let x1 be income
• Let x2 = 1 when genhealth='Very good', and 0 otherwise,
• let x3 = 1 when genhealth='Good', and 0 otherwise,
• let x4 = 1 when genhealth='Fair', and 0 otherwise,
• let x5 = 1 when genhealth='Poor', and 0 otherwise.
The reference group for genhealth is Excellent.
The mathematical model would look like:
Y ∼ β0 + β1 ∗ x1 + β2x2 + β3x3 + β4x4 + β5x5
7
3. Fit the multivariable model.
Print out the coefficients and 95% CI’s.
gh.model <- lm(BMI~income + genhealth, data=addhealth)
summary(gh.model)
##
## Call:
## lm(formula = BMI ~ income + genhealth, data = addhealth)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.837 -4.802 -1.091 3.441 39.132
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.652e+01 3.298e-01 80.409 < 2e-16 ***
## income -4.735e-06 4.686e-06 -1.010 0.312
## genhealthVery good 1.602e+00 3.100e-01 5.166 2.52e-07 ***
## genhealthGood 4.758e+00 3.245e-01 14.664 < 2e-16 ***
## genhealthFair 6.917e+00 5.039e-01 13.726 < 2e-16 ***
## genhealthPoor 9.350e+00 1.392e+00 6.717 2.13e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.967 on 3771 degrees of freedom
## (2727 observations deleted due to missingness)
## Multiple R-squared: 0.09217, Adjusted R-squared: 0.09096
## F-statistic: 76.57 on 5 and 3771 DF, p-value: < 2.2e-16
round(confint(gh.model),1) %>% kable()
2.5 % 97.5 %
(Intercept) 25.9 27.2
income 0.0 0.0
genhealthVery good 1.0 2.2
genhealthGood 4.1 5.4
genhealthFair 5.9 7.9
genhealthPoor 6.6 12.1
4. Interpret the regression coefficients.
• b0: The predicted BMI for individuals with no income and excellent health is 26.5 (25.9, 27.2).
• b1: After controlling for general health, for every additional $1 a person makes annually, their BMI
decreases .0000047. This is not a significant relationship. A more meaningful interpretation would be
to look at a $1000 increase in annual income. For every additional $1,000,000 in income a person makes
annually, their BMI decreases by 4.7.
• b2: Those reporting very good health have 1.6 (0.99, 2.2, p<.0001) higher BMI compared to those
reporting excellent health.
• b3:Those reporting good health have 4.8 (4.1, 5.4, p<.0001) higher BMI compared to those reporting
excellent health.
• b4: Those reporting fair health have 6.9 (5.9, 7.9, p<.0001) higher BMI compared to those reporting
8
excellent health.
• b5: Those reporting poor health have 9.4 (6.6, 12.1, p<.0001) higher BMI compared to those reporting
excellent health.
5. Conclusion
After controlling for general health, income is not significantly associated with BMI. General health is
significantly associated with BMI, the average BMI increases as reported general health decreases.
9