Starting from:

$35

Homework 3 Linear regression

Homework Set Three
ECE 271A
Department of Computer and Electrical Engineering

1. In this problem we will consider the issue of linear regression and the connections between maximum
likelihood and least squares solutions. Consider a problem where we have two random variables Z and
X, such that
z = f(x, θ) + , (1)
where f is a polynomial with parameter vector θ
f(x, θ) =
K
k=0
θkxk (2)
and a Gaussian random variable of zero mean and variance σ2. Our goal is to estimate the best
estimate of the function given an iid sample D = {(Dx, Dz)} = {(x1, z1),...,(xn, zn)}.
a) Formulate the problem as one of least squares, i.e define z = (z1,...,zn)T ,
Φ =



1 ... xK
1
.
.
. .
.
.
1 ... xK
n



and find the value of θ that minimizes
||z − Φθ||2.
b) Formulate the problem as one of ML estimation, i.e. write down the likelihood function PZ|X(z|x; θ),
and compute the ML estimate, i.e. the value of θ that maximizes PZ|X (Dz|Dx; θ). Show that this is
equivalent to a).
c) (The advantage of the statistical formulation is that makes the assumptions explicit. We will now
challenge some of these.) Assume that instead of a fixed variance σ2 we now have a variance that
depends on the sample point, i.e.
zi = f(xi, θ) + i, (3)
where i ∼ N (0, σ2
i ). This means that our sample is independent but no longer identically distributed.
It also means that we have different degrees of confidence in the different measurements (zi, xi). Redo
b) under these conditions.
d) Consider the weighted least squares problem where the goal is to minimize
(z − Φθ)
TW(z − Φθ)
where W is a symmetric matrix. Compute the optimal θ for this situation. What is the equivalent
maximum likelihood problem? Rewrite the model (1), making explicit all the assumptions that lead to
the new problem. What is the statistical interpretation of W?
1
e) The L2 norm is known to be prone to large estimation error if there are outliers in the training sample.
These are training examples (zi, xi) for which, due to measureme qnt errors or other extraneous causes,
|zi − θixi| is much larger than for the remaining examples (the inliers). In fact, it is known that a
single outlier can completely derail the least squares solution, an highly undesirable behavior. It is also
well known that other norms lead to much more robust estimators. One of such distance metrics is the
L1-norm
L1 =
i
|zi −
k
θkxk
i |.
In the maximum likelihood framework, which is the statistical assumption that leads to the L1 norm?
Once again, rewrite the model (1), making explicit all the assumptions that lead to the new problem.
Can you justify why this alternative formulation is more robust? In particular, provide a justification
for i) why the L1 norm is more robust to outliers, and ii) the associated statistical model (1) copes
better with them.
2.
a) Problem 3.5.17 in DHS.
b) What is the ML estimate for θ in this problem? What is the MAP estimate for θ in this problem?
Do you see any advantage in favoring one of the estimates in favor of the others? How does that relate
to the uniform prior that was assumed for θ?
3. Consider problem 3 of the previous assignment, i.e. a random variable X such that PX (k) = πk, k ∈
{1,...,N}, n independent observations from X, a random vector C = (C1,...,CN )T where Ck is the
number of times that the observed value is k (i.e. C is the histogram of the sample of observations).
We have seen that, C has multinomial distribution
PC1,...,CN (c1,...,cN ) = n!

N
k=1 ck!
?
N
j=1
πcj
j .
In this problem we are going to compute MAP estimates for this model. Notice that the parameters
are probabilities and, therefore, not every prior will be acceptable here (since πj 0 and
j πj = 1 for
the prior to be valid). One distribution over vectors π = (π1,...,πN )T that satisfies this constraint is
the Dirichlet distribution
PΠ1,...,ΠN (π1,...,πN ) = Γ( N
j=1 uj )

N
j=1 Γ(uj )
?
N
j=1
πuj−1
j
where the uj are a set of hyperparameters (parameters of the prior) and
Γ(x) = ? ∞
0
e−t
t
x−1dt
the Gamma function.
a) Derive the MAP estimator for the parameters πi, i = 1,...,N using the Dirichlet prior.
b) Compare this estimator with the ML estimator derived in the previous assignment. What is the use
of this prior equivalent to, in terms of the ML solution?
2
c) What is the effect of the prior as the number of samples n increases? Does this make intuitive sense?
d) In this problem and problem 2. we have seen two ways of avoiding the computational complexity
of computing a fully Bayesian solution: i) to rely on a non-informative prior, and ii) to rely on an
informative prior and compute the MAP solution. Qualitatively, what do you have to say about the
results obtained with the two solutions? What does this tell you about the robustness of the Bayesian
framework?
4. (computer) (Note: This week’s computer problem requires more computation time than the
previous ones. For that reason, I will give you more time to finish it up, i.e. it will be repeated in
homework 4. Do not, however, take this as a reason to not start working on it right away. You may
simply not have time to finish if you leave it to the days prior to the deadline.)
This week we will continue trying to classify our cheetah example. Once again we use the decomposition into 8 × 8 image blocks, compute the DCT of each block, and zig-zag scan. We also
continue to assume that the class-conditional densities are multivariate Gaussians of 64 dimensions.
The goal is to understand the benefits of a Bayesian solution. For this, using the training data in
TrainingSamplesDCT new 8.mat we created 4 datasets of size given by the table below. They are
available in the file TrainingSamplesDCT subsets 8.mat
Dataset cheetah examples grass examples
D1 75 300
D2 125 500
D3 175 700
D4 225 900
We start by setting up the Bayesian model. To simplify things a bit we are going to cheat a little.
With respect to the class-conditional,
Px|µ,Σ = G(x, μ, Σ).
we assume that we know the covariance matrix (like Bayes might) but simply replace it by the sample
covariance of the training set, D, that we are working with (and hope he doesn’t notice)1. That is, we
use
Σ = 1
N

N
i=1
xi − 1
N

N
i=1
xi
? xi − 1
N

N
i=1
xi
?T
We are, however, going to assume unknown mean and a Gaussian prior of mean μ0 and covariance Σ0
Pµ(μ) = G(μ, μ0, Σ0).
Regarding the mean μ0, we assume that it is zero for all coefficients other than the first (DC) while for
the DC we consider two different strategies:
• strategy 1 : μ0 is smaller for the (darker) cheetah class (μ0 = 1) and larger for the (lighter) grass
class (μ0 = 3).
1There are appropriate ways to set up covariance priors but things would get a little more complex, and we avoid
them.
3
• strategy 2 : μ0 is equal to half the range of amplitudes of the DCT coefficient for both classes
(μ0 = 2);
For the covariance Σ0 we assume a diagonal matrix with (Σ0)ii = αwi. The mean μ0 (for the two
strategies) and the weights wi are given in the files Prior 1.mat (strategy 1) and Prior 2.mat (strategy
2).
a) Consider the training set D1 and strategy 1 (we will use this strategy until d)). For each class,
compute the covariance Σ of the class-conditional, and the posterior mean μ1, and covariance Σ1 of
Pµ|T(μ|D1) = G(μ, μ1, Σ1).
Next, compute the parameters of the predictive distribution
Px|T(x|D1)
for each of the classes. Then, using ML estimates for the class priors, plug into the Bayesian decision
rule, classify the cheetah image and measure the probability of error. All of the parameters above are
functions of α. Repeat the procedure for the values of α given in the file Alpha.mat. Plot the curve of
the probability of error as a function of α. Can you explain the results?
b) For D1, compute the probability of error of the ML procedure identical to what we have used last
week. Compare with the results of a). Can you explain? See “what to hand in” below.
c) Repeat a) with the MAP estimate of μ, i.e. using
Px|T(x|D1) = Px|µ(x|μMAP )
where
μMAP = arg max
µ
Pµ|T(μ|D1)
Compare the curve with those obtained above. Can you explain the results? See “what to hand in”
below.
d) Repeat a) to c) for each of the datasets Di, i = 2,..., 4. Can you explain the results? See “what to
hand in” below.
e) Repeat a) to d) under strategy 2 for the selection of the prior parameters. Comment the differences
between the results obtained with the two strategies. See “what to hand in” below.
Some Tips:
1. The prior for mu’s are different for background and foreground. So in each file the mu’s are in
’mu0 BG’ and ’mu0 FG’ for background and foreground respectively.
2. When displaying the curves of PE vs. α, it is better to use ’log’ mode in the α axis. There are
two commands in MATLAB that you should use, assuming the x−axis represents α: 1) set(gca,
’XScale’, ’log’); or, 2) semilogX(....). Please refer to MATLAB help for details.
3. Remember to use a fast Gaussian evaluation inside the loop.
4
What to hand in: To minimize the number of pages on your report I would advise on handing in,
for each dataset and each strategy, a single plot with the curves of classification error as a function of
α for: 1) solution based on the predictive equation, 2) MAP solution, and 3) ML solution. The latter
will obviously be a constant (horizontal) line. Try to explain 1) the relative behavior of these three
curves, 2) how that behavior changes from dataset to dataset, and 3) how all of the above change when
strategy 1 is replaced by strategy 2.
5

More products