$30
SCHOOL OF MATHEMATICS AND STATISTICS
MATH3821 Statistical Modelling and Computing
Assignment One
Number of exercises: 5 (one per page)
INSTRUCTIONS: This assignment is to be done by a group of at most 5 students. The same mark
will be given to each student within the group, unless I have good reasons to believe that somebody did
not contribute appropriately. It is strongly advised that you use the RStudio software and its File/New
File/R Markdown. . . /PDF capability to produce a PDF file that you will submit on Moodle (see instructions
on Moodle close to due date). The computing language you will be using is called RMarkdown (see the
first few lessons starting here https://rmarkdown.rstudio.com/lesson-1.html for a quick introduction). For
typesetting mathematical formulae, you will need to have a distribution of the LATEX software installed on
your computer (e.g., TEX Live or MikTEX). For Microsoft Windows and Unix users, you might consider using
the install_tinytex() function from the R package tinytex. Another function (Microsoft Windows only)
is install.MikTeX() from the installr R package. MacOS users should consider installing the MacTEX
software directly; this is not an R package (see https://www.tug.org/mactex/).
Only one of the five students should submit the PDF file, with the names of the other students in the group
clearly indicated in the document.
We declare that this assessment item is our own work, except where acknowledged, and has not been submitted
for academic credit elsewhere. We acknowledge that the assessor of this item may, for the purpose of assessing
this item reproduce this assessment item and provide a copy to another member of the University; and/or
communicate a copy of this assessment item to a plagiarism checking service (which may then retain a copy
of the assessment item on its database for the purpose of future plagiarism checking). We certify that we
have read and understood the University Rules in respect of Student Academic Misconduct.
Name Student No Signature Date
1
Question One
Recall the Simple Linear Regression (SLR) model Yi = β0 + β1xi + i where i ∼ N(0, σ2
).
(a) Show that the SLR model can be expressed in the following form
Yi = α + β1(xi − x¯) + i
.
(b) Provide an interpretation for the parameter α.
(c) Find a closed form formula for the least square parameter estimates αˆ and βˆ
1.
(d) Find the variance of the estimates αˆ and βˆ
1 and the covariance between them Cov(ˆα, βˆ
1).
(e) Using the method of gradient descent with γ = 0.00001, find the estimates for the following simulated
data:
set.seed(1234567)
x = runif(1000)
eps = rnorm(1000)
y = 5 + 10*x + eps
Start the algorithm at the initial value (α
[0], β[0]
1
) = (0, 0). Use the convergence criteria that the L2
norm of the score is less than 0.00001. Show that the results are comparable to the closed formed
solutions found in (c). Report the number of iterations required. Make sure that you provide all the
workings/derivations that are needed to implement the above algorithm.
(f) Plot the data and include the fitted regression line.
(g) Using the Newton-Raphson method find the estimates for the same simulated data. Use the same initial
value (α
[0], β[0]
1
) = (0, 0) and convergence criteria as (e). Did it take more or less iterations than part
(e). Why? Make sure that you provide all the workings/derivations that are needed to implement the
above algorithm.
2
Quesion Two
Consider n independent binary random variables Y1, ..., Yn with
P(Yi = 1) = πi and P(Yi = 0) = 1 − πi
.
The probability function of Yi
is:
π
Yi
i
(1 − πi)
1−Yi
where Yi = 0 or 1.
(a) Show that this probability function belongs to the exponential family of distributions.
(b) Show that the natural parameter is
log
πi
1 − πi
.
(c) Show that E(Yi) = πi using the cumulant generator c(θ) in the definition of the exponential family.
(d) Suppose the link function is
g(π) = log
π
1 − π
= x
T β.
Show that this is equivalent to modelling the probability π as
π =
e
x
T β
1 + e
xT β
.
(e) Sketch the graph of π aganist x for the particular case x
T β = β1 + β2x where β1 and β2 are constants.
How would you interpret this graph if x is the dose of an insecticide and π is the probability of an
insect dying?
(f) Does the following probability density function
f(y; θ) = 1
φ
exp
(y − θ)
φ
− exp
(y − θ)
φ
where φ > 0 is regarded as a nuisance parameters, belong to the exponential family?
3
Question Three
The Titanic was a British luxury passenger liner that sank when it struck an iceberg about 640 km south of
Newfoundland on April 14–15, 1912, on its maiden voyage to New York City from Southampton, England.
The data in the file titanic.txt (from the assignment section on Moodle!) classify the people on
board the ship according to their Sex, Age, and Class, either first, second, third.
(a) Read the file titanic.txt (see Moodle) into a variable called titanic. Display the first six lines of
titanic and then provide a summary of the variables in the dataset using summary.
(b) Compute the number of men and women on the Titanic. Calculate the survivial rates for each sex.
Conduct a test which tests whether the survivial rates for men and women are the same aganist the
alternative that they are different. What is the hypothesis, test statistic, p-value and conclusion from
the test?
(c) Fit a logistic regression model with response Survived and predictor Age, and provide an interpretation
for the fitted coefficient for Age using the odds ratio with a factor change and a standardized factor
change in the variable Age.
(d) Plot the graph of Survived versus Age. Then add both a fitted logistic curve and a loess smoother
to the graph. Explain what the differences are betwen these two fits. Fit again, but this time, add a
quadratic term in Age. Does the fitted curve now match the smoother more accurately? Provide all
plots in a single graph, with correctly defined labels, titles and a legend.
(e) Use the method of scoring algorithm to compute an estimate of the parameters of the logistic regression
model with survived as the response and age and a quadratic term in age as explantory variables and
provide your R code for it. You must also present the calculations that you used to come up with your
algorithm.
(f) Check that, using the code in (e), you obtain estimates of the coefficients numerically close to the ones
given by the glm() function.
(g) Create an R code, and provide it, to compute the estimation of the variances-covariances matrix of the
corresponding estimators (using the first approach presented in the slide entitled “Estimation of the
variance” in Chapter 2).
(h) Check the numerical closeness of the result obtained using your code from (g) to the one you get when
using the vcov() function.
(i) Fit the logistic regression model with terms for an intercept, Age, Age2
, Sex, and PClass. Obtain tests
on the basis of the deviance for adding each of the terms to a mean function that already includes the
other terms (in the order given above), and summarize the results of each of the tests via a p-value and
a one-sentence summary of the results.
(j) Provide a plot that interprets the relationship between Age, Sex and their Survival rates. Make sure
that you include titles with a legend.
4
Question Four
In this question we will examine binomial response data. Consider the single response Y with Y ∼
binomial(n, π).
(a) Find the Wald statistic (πˆ − π)I(π)(πˆ − π) where πˆ is the maximum likelihood estimator of π and I(π)
is the information.
(b) Verify that the Wald statistic is the same as the score statistic U
>I(π)
−1U in this case.
(c) Find the deviance
2
log L(ˆπ; y) − log L(π; y)
.
(d) For large samples, both the Wald/score statistic and the deviance approximately have the χ
2
(1)
distribution. For n = 10 and y = 3 use both statistics to assess the adequacy of the models:
(i) π = 0.1;
(ii) π = 0.3;
(iii) π = 0.5. Do the two statistics lead to the same conclusions.
(e) Give the three parts of the GLM for the binomial regression model with a fixed number of trials:
• state the law of Y ;
• prove it is a member of the exponential family;
• give the parameters (notably the mean µi) and the canonical link function.
5