$29.99
Assignment #3 STA355H1S
Instructions: Solutions to problems 1 and 2 are to be submitted on Quercus (PDF files
only) – the deadline is 11:59pm on March 20. You are strongly encouraged to do problems
3 through 6 but these are not to be submitted for grading.
Problems to hand in:
1. Suppose that X1, · · · , Xn are independent Gamma random variables with pdf
f(x; λ, α) = λ
αx
α−1
exp(−λx)
Γ(α)
for x > 0
where λ > 0 and α > 0 are unknown parameters. Given X1 = x1, · · · , Xn = xn, the
likelihood function is
L(λ, α) =
λ
nα (Yn
i=1
x
α−1
i
)
exp
−λ
Xn
i=1
xi
!
[Γ(α)]n
(a) Assume the following prior distribution for (λ, α):
π(λ, α) = 1
10000
exp(−λ/100) exp(−α/100) for λ, α > 0
Given X1 = x1, · · · , Xn = xn, show that the posterior density of α is
π(α|x1, · · · , xn) = K(x1, · · · , xn)
Γ(nα + 1)
[Γ(α)]n
exp
α
Xn
i=1
ln(xi) −
α
100! 1
100
+
Xn
i=1
xi
!−(nα+1)
(b) Data on intervals (in hours) between failures of air conditioning units on ten Boeing
aircraft are given (and analyzed) in Example T of Cox & Snell1
(1981). These data (199
observations) are provided in a file on Quercus. Using the prior for (λ, α) in part (a), compute
the posterior distribution for α.
Note: To compute the posterior density, you will need to compute the normalizing constant
– on Quercus, I will give some suggestions on how to do this. A simple estimate of α is
αb = ¯x
2/s2 where ¯x and s
2 are the sample mean and variance, respectively. We would expect
the posterior to be concentrated around this estimate.
1Cox, D.R. and Snell, E.J. (1981) Applied Statistics: Principles and Examples. Chapman and Hall, New
York.
(c) [Bonus] We may be interested in “testing” whether an Exponential model is an appropriate model for the data. One approach to doing this is to put a prior probability θ on the
Exponential model (i.e. a Gamma model with α = 1) and prior probability 1−θ on the more
general Gamma model. This leads to a prior distribution on (λ, α) having a point mass θ
(which is a hyperparameter) at α = 1 so that
π(λ, 1) = 1
100
θ exp(−λ/100) for λ > 0
and
π(λ, α) = 1
10000
(1 − θ) exp(−λ/100) exp(−α/100) otherwise
(This type of prior is an example of a “spike-and-slab” prior used in Bayesian model selection.) We then have
P(α = 1|x1, · · · , xn) =
Z ∞
0
π(λ, 1)L(λ, 1) dλ
Z ∞
0
π(λ, 1)L(λ, 1) dλ +
Z ∞
0
Z ∞
0
π(λ, α)L(λ, α) dλ dα
.
For θ = 0.1, 0.2, 0.3, · · · , 0.9, evaluate P(α = 1|x1, · · · , xn). (This is easier than it looks as
much of the work has been done in part (b).)
2. Suppose that F is a continuous distribution with density f. The mode of the distribution
here is defined to be the global maximizer of the density function. (In other applications, it
is useful to think of modes as local maxima of the density function.)
In some applications (for example, when the data are “contaminated” with outliers), the
centre of the distribution is better described by the mode than by the mean or median;
however, unlike the mean and median, the mode turns out to be a difficult parameter to
estimate. There is a document on Quercus that discusses a few of the various methods for
estimating the mode.
The τ -shorth is the shortest interval that contains at least a fraction τ of the observations
X1, · · · , Xn; it will have the form [X(a)
, X(b)
] where b − a + 1 ≥ τn. A number of mode
estimators are based on the observations lying in [X(a)
, X(b)
]; for example, we can the sample
mean of these observation or the sample median. (Note that taking the sample mean of the
observations in [X(a)
, X(b)
] is like a trimmed mean although the trimming here is typically
asymmetrical.) In this problem, we will consider estimating the mode using the midpoint of
the interval, that is, µb = (X(a) + X(b))/2. This estimator is called a Venter estimator. An
R function to compute this estimator for a given value of τ is available on Quercus in a file
venter.txt.
(a) For the Hidalgo stamp thickness data considered in Assignment #2, compute Venter
estimates for various values of τ . How small does τ need to be in order that the estimate
“makes sense”? (Recall from Assignment #2 that the density seemed to have a number of
local maxima but one clear global maximum.)
(b) Suppose that the underlying density f is asymmetric and unimodal. The choice of τ for
the Venter estimator involves a bias-variance tradeoff: If τ is small then we should expect the
estimator to have a small bias but larger variance with the bias increasing and the variance
decreasing as τ increases.
Suppose that X1, · · · , Xn are independent Gamma random variables with density
f(x; α) = x
α−1
exp(−x)
Γ(α)
where we will assume that α > 1; in this case, the mode is α − 1.
Using Monte Carlo simulation, estimate the MSE of the Venter estimator for τ = 0.5 and
τ = 0.1, sample sizes n = 100 and n = 1000, and shape parameters α = 2 and α = 10. (8
simulations in total.) For α = 2 and α = 10, which Venter estimator seems to be better (on
the basis of MSE)?
(c) The density of Venter estimator µb under the Gamma model in part (b) can be estimated
using the fact that µ/b (X1 + · · · + Xn) and X1 + · · · + Xn are independent. (This fact follows
from Basu’s Theorem. which you may encounter in STA452/453.)
We can exploit this independence as follows: Define T = X1 + · · · + Xn; T has a Gamma
distribution with shape parameter n. Then
P(µb ≤ x) = P
µb
T
≤
x
T
=
Z ∞
0
P
µb
T
≤
x
T
T = t
t
nα−1
exp(−t)
Γ(nα)
dt
=
Z ∞
0
P
µb
T
≤
x
t
T = t
t
nα−1
exp(−t)
Γ(nα)
dt
=
Z ∞
0
P
µb
T
≤
x
t
t
nα−1
exp(−t)
Γ(nα)
dt
Thus given (µb1, T1), · · · ,(µbN , TN ), we can estimate
P
µb
T
≤
x
t
by
1
N
X
N
i=1
I
µbi
Ti
≤
x
t
=
1
N
X
N
i=1
I
t ≤
Ti
µbi
x
!
and so we can estimate P(µb ≤ x) by
1
N
X
N
i=1
Z xTi/µbi
0
t
nα−1
exp(−t)
Γ(nα)
dt
and differentiating, we obtain the density estimate
bf(x) = 1
N
X
N
i=1
Ti
µbi
Ti
µbi
x
!nα−1
exp(−xTi/µbi)
Γ(nα)
Using the simulation data in part (b) for n = 100, α = 2 and τ = 0.5, compute bf(x).
Supplemental problems (not to hand in):
3. Suppose that X1, · · · , Xn are independent random variables with density or mass function f(x; θ) and suppose that we estimate θ using the maximum likelihood estimator bθ; we
estimate its standard error using the observed Fisher information estimator
se( b
bθ) = (
−
Xn
i=1
`
00(Xi
;
bθ)
)−1/2
where `
0
(x; θ), `00(x; θ) are the first two partial derivatives of ln f(x; θ) with respect to θ.
Alternatively, we could use the jackknife to estimate the standard error of bθ; if our model
is correct then we would expect (hope) that the two estimates are similar. In order to
investigate this, we need to be able to get a good approximation to the “leave-one-out”
estimators {
bθ−i}.
(a) Show that bθ−i satisfies the equation
`
0
(Xi
;
bθ−i) = Xn
j=1
`
0
(Xj
;
bθ−i).
(b) Expand the right hand side in (a), in a Taylor series around bθ to show that
bθ−i − bθ ≈
`
0
(Xi
;
bθ−i)
Xn
j=1
`
00(Xj
;
bθ)
≈
`
0
(Xi
;
bθ)
Xn
j=1
`
00(Xj
;
bθ)
and so
bθ• =
1
n
Xn
i=1
bθ−i ≈ bθ.
(You should try to think about the magnitude of the approximation error but a rigorous
proof is not required.)
(c) Use the results of part (b) to derive an approximation for the jackknife estimator of the
standard error. Comment on the differences between the two estimators - in particular, why
is there a difference? (Hint: What type of model – parametric or non-parametric – are we
assuming for the two standard error estimators?)
(d) For the air conditioning data considered in Assignment #1, compute the two standard
error estimates for the parameter λ in the Exponential model (f(x; λ) = λ exp(−λx) for
x ≥ 0). Do these two estimates tell you anything about how well the Exponential model fits
the data?
4. Suppose that X1, · · · , Xn are independent continuous random variables with density
f(x; θ) where θ is real-valued. We are often not able to observe the Xi
’s exactly rather only
if they belong to some region Bk (k = 1, · · · , m); an example of this is interval censoring in
survival analysis where we are unable to observe an exact failure time but know that the
failure occurs in some finite time interval. Intuitively, we should be able to estimate θ more
efficiently with the actual values of {Xi}; in this problem, we will show that this is true (at
least) for MLEs.
Assume that B1, · · · , Bm are disjoint sets such that P(Xi ∈ ∪m
k=1Bk) = 1. Define independent
discrete random variables Y1, · · · , Yn where Yi = k if Xi ∈ Bk; the probability mass function
of Yi
is
p(k; θ) = Pθ(Xi ∈ Bk) = Z
x∈Bk
f(x; θ) dx for k = 1, · · · , m.
Also define
IX(θ) = Varθ
"
∂
∂θ ln f(Xi
; θ)
#
=
Z ∞
−∞ "
∂
∂θ ln f(x; θ)
#2
f(x; θ) dx
IY (θ) = Varθ
"
∂
∂θ ln p(Yi
; θ)
#
=
Xm
k=1
"
∂
∂θ ln p(k; θ)
#2
p(k; θ).
Under the standard MLE regularly conditions, the MLE of θ based on X1, · · · , Xn will have
variance approximately 1/{nIX(θ)} while the MLE based on Y1, · · · , Yn will have variance
approximately 1/{nIY (θ)}.
(a) Assume the usual regularity conditions for f(x; θ), in particular, that f(x; θ) can be
differentiated with respect to θ inside integral signs with impunity! Show that IX(θ) ≥ IY (θ)
and indicate under what conditions there will be strict inequality.
Hints: (i) f(x; θ)/p(k; θ) is a density function on Bk.
(ii) For any function g,
Z ∞
−∞
g(x)f(x; θ) dx =
Xm
k=1
p(k; θ)
Z
x∈Bk
g(x)
f(x; θ)
p(k; θ)
dx.
(iii) For any random variable U, E(U
2
) ≥ [E(U)]2 with strict inequality unless U is constant.
(b) Under what conditions on B1, · · · , Bm will IX(θ) ≈ IY (θ)?
5. In seismology, the Gutenberg-Richter law states that, in a given region, the number of
earthquakes N greater than a certain magnitude m satisfies the relationship
log10(N) = a − b × m
for some constants a and b; the parameter b is called the b-value and characterizes the seismic
activity in a region. The Gutenberg-Richter law can be used to predict the probability of
large earthquakes although this is a very crude instrument. On Quercus, there is a file
containing earthquakes magnitudes for 433 earthquakes in California of magnitude (rounded
to the nearest tenth) of 5.0 and greater from 1932–1992.
(a) If we have earthquakes of (exact) magnitudes M1, · · · , Mn greater than some known m0,
the Gutenberg-Richter law suggests that M1, · · · , Mn can be modeled as independent random
variables with a shifted Exponential density
f(x; β) = β exp(−β(x − m0)) for x ≥ m0.
where β = b × ln(10) and m0 is assumed known. However, if the magnitudes are rounded
to the nearest δ then they are effectively discrete random variables taking values xk =
m0 + δ/2 + kδ for k = 0, 1, 2, · · · with probability mass function
p(xk; β) = P(m0 + kδ ≤ M < m0 + (k + 1)δ)
= exp(−βkδ) − exp(−β(k + 1)δ)
= exp(−β(xk − m0 − δ/2)) {1 − exp(−βδ)} for k = 0, 1, 2, · · ·.
If X1, · · · , Xn are the rounded magnitudes, find the MLE of β. (There is a closed-form
expression for the MLE in terms of the sample mean of X1, · · · , Xn.)
(b) Compute the MLE of β for the earthquake data (using m0 = 4.95 and δ = 0.1) as well
as estimates of its standard error using (i) the Fisher information and (ii) the jackknife. Use
these to construct approximate 95% confidence intervals for β. How similar are the intervals?
6. In genetics, the Hardy-Weinberg equilibrium model characterizes the distributions of
genotype frequencies in populations that are not evolving and is a fundamental model of
population genetics. In particular, for a genetic locus with two alleles A and a, the frequencies
of the genotypes AA, Aa, and aa are
Pθ(genotype = AA) = θ
2
, Pθ(genotype = Aa) = 2θ(1−θ), and Pθ(genotype = aa) = (1−θ)
2
where θ, the frequency of the allele A in the population, is unknown.
In a sample of n individuals, suppose we observe XAA = x1, XAa = x2, and Xaa = x3
individuals with genotypes AA, Aa, and aa, respectively. Then the likelihood function is
L(θ) = {θ
2
}
x1 {2θ(1 − θ)}
x2 {(1 − θ)
2
}
x3
.
(The likelihood function follows from the fact that we can write
Pθ(genotype = g) = {θ
2
}
I(g=AA)
{2θ(1 − θ)}
I(g=Aa)
{(1 − θ)
2
}
I(g=aa)
;
multiplying these together over the n individuals in the sample gives the likelihood function.)
(a) Find the MLE of θ and give an estimator of its standard error using the observed Fisher
information.
(b) A useful family of prior distributions for θ is the Beta family:
π(θ) = Γ(α + β)
Γ(α)Γ(β)
θ
α−1
(1 − θ)
β−1
for 0 ≤ θ ≤ 1
where α > 0 and β > 0 are hyperparameters. What is the posterior distribution of θ given
XAA = x1, XAa = x2, and Xaa = x3?