Starting from:

$30

Generalised Linear Models Computer Lab 3

MATH38172 Generalised Linear Models
Computer Lab 3
Fitting GLMs with various response distributions
Recall that the assumptions of a GLM are that
Yi
indep ∼ F(θi
, φ/wi) F an exponential dispersion family
µi = E(Yi) µ the expected response
g(µi) = ηi g the link function
ηi = β1x1 + . . . + βpxp η the linear predictor
We have already seen how to fit two types of GLM using the glm command: Poisson response models with a
log link (using family=poisson), and Binomial response models with a logit link (using family=binomial).
More generally, we can fit models with other response distribution families, e.g.
• glm(y~x, family=Gamma) for a Gamma GLM
• glm(y~x, family=inverse.gaussian) for an inverse Gaussian response
The gamma and inverse-Gaussian distributions are commonly used to model positive continuous responses.
They may be appropriate if the variance increases with the mean, since V (µ) = µ
2
for the Gamma distribution
and V (µ) = µ
3
for the inverse-Gaussian.
By default, R uses the canonical link function, or a scalar multiple, e.g.:
Family Canonical link Equation R link
Binomial proportion logit g(µ) = log µ
1−µ
Poisson log g(µ) = log µ
Gamma reciprocal g(µ) = −1/µ g(µ) = 1/µ
Inverse-Gaussian inverse-square g(µ) = −1/(2µ
2
) g(µ) = 1/µ2
Note that, for a gamma response, using R’s link rather than the true canonical link corresponds to multiplying
the βj by −1.
Different link functions can be specified using the link argument as follows:
• glm(y~x, family=binomial(link="probit"))
• glm(y~x, family=Gamma(link="log"))
• glm(y~x, family=Gamma(link="identity"))
For full details of options, see the R help file ?family
For Gamma responses, the canonical link has the problem that µ = 1/η is negative for some values of η. As
the expectation of a gamma distribution must be positive, this leads to constraints on the parameters and
explanatory variables. However, the log link does not suffer from this problem, as µ = g
−1
(η) = e
η
is always
positive. As a result the log link is a common choice for the Gamma distribution.
1
Model comparison
Hypothesis tests
Recall that if we wish to compare two GLMs with the same response distribution but different (nested) linear
predictors, e.g.
Model A: η = β1x1 + · · · + βpA xpA
Model B: η = β1x1 + · · · + βpA xpA + · · · + βpB xpB
then this can be done by testing the null hypothesis H0 : Model A is correct versus H1 : Model B is correct.
Recall that for some exponential families (e.g. Binomial or Poisson) the dispersion parameter is known
(e.g. φ = 1). As we have seen in lectures, in this case, the fit of Model A and Model B can be compared using
a likelihood ratio test:
Reject H0 if L = 2{`B − `A} =
1
φ
[D(µˆ A, y) − D(µˆ B, y)] > χ2
α;pB−pA
.
This can easily be done in R using the anova command.
Example For the Small Business Administration data (SBA.csv), compare the binary-response logistic
regression models with the following two linear predictors:
Model A: η ∼ Portion
Model B: η ∼ Portion + Recession + RealEstate
SBA <- read.csv("SBA.csv")
fitA <- glm(Default~Portion, family=binomial, data=SBA)
fitB <- update(fitA, ~.+ Recession +RealEstate)
Note the use of the update command to include additional variables in the linear predictor of Model A. Then
conduct the test:
anova(fitA,fitB,test="Chisq")
## Analysis of Deviance Table
##
## Model 1: Default ~ Portion
## Model 2: Default ~ Portion + Recession + RealEstate
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 2100 2340.7
## 2 2098 2222.9 2 117.79 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We see that the additional terms in the linear predictor significantly improve the model; the p-value of the
test is < 2.2 × 10−16, so we would reject H0 at any reasonable significance level.
It can sometimes be useful to directly access the deviance of a given model, and/or its residual degrees of
freedom n − p. This can be done via
deviance(fitA); df.residual(fitA)
## [1] 2340.664
## [1] 2100
2
For other families (e.g. Gamma/inverse-Gaussian), φ is unknown. As we have seen in lectures, in this case,
the fit of Model A and Model B can be compared using analysis of deviance, i.e.
Reject H0 if F =
[D(µˆ A, y) − D(µˆ B, y)]/(pB − pA)
φˆB
> Fα;(pB−pA),(n−pB)
,
where φˆB is an estimate of φ computed using Model B, usually the Pearson estimate. Again this can be done
easily in R using the anova command in R.
Example For the house price data (Houses.csv), using a Gamma response GLM with price as the response
and identity link, compare the following two linear predictors:
Model A η ∼ size
Model B η ∼ size + new + size:new
Houses <- read.csv("Houses.csv")
fitA <- glm(price~size, family=Gamma(link="identity"), data=Houses)
fitB <- glm(price~size+new+new:size, family=Gamma(link="identity"), data=Houses)
anova(fitA,fitB,test="F")
## Analysis of Deviance Table
##
## Model 1: price ~ size
## Model 2: price ~ size + new + new:size
## Resid. Df Resid. Dev Df Deviance F Pr(>F)
## 1 98 11.283
## 2 96 10.563 2 0.72071 3.2698 0.04229 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Thus the inclusion of the main effect of new and the interaction term new:size significantly improves the
model at the 5% significance level.
Exercises 1
1. For the SBA data:
a. Assess whether inclusion of the Portion:RealEstate and Portion:Recession interactions significantly
improves Model B.
b. Interpret the resulting model.
c. Try fitting a model that includes the main effects and two-factor interactions between Portion,
RealEstate, and Recession. What do you notice about the parameter estimates?
d. Why does this happen? [Hint: tabulate the combinations of values of the variables Recession and
RealEstate that are present in the data using table(SBA$Recession, SBA$RealEstate)]
2. Suppose that n = 300 patients are treated, with m = 100 patients allocated to each of three treatments.
Each patient either recovers or not. The sample recovery rates are 87%, 83%, and 78%, for Treatments
A, B, and C respectively. Using R, assess if there is significant evidence at the 5% significance level
that the probability of recovery depends on which treatment is applied.
3. The file nambeware.csv contains data about items produced by Nambe Mills, a tableware manufacturer.
The following variables are included: Type (item type), Diam (item diameter), Time (item polishing and
grinding time) and Price (item price).
a. Plot item price versus diameter and explain why a Gamma GLM may be suitable.
b. Using Gamma GLMs with a log link, assess whether there is significant evidence that the item
price depends on the item type, adjusting for item diameter and polishing and grinding time.
3

More products