$30
COMS 4771 HW1
You are allowed to work in groups of (at max) three students. Only one submission per group
is required by the due date. Name and UNI of all group members must be clearly specified on the
homework. You must cite all the references you used to do this homework. You must show your
work to receive full credit.
1 [Maximum Likelihood Estimation] Here we shall examine some properties of Maximum
Likelihood Estimation (MLE).
(i) Consider the density p(x|θ) ∝
n
x
2
e
−x/θ if x ≥ 0
0 otherwise , for some θ 0. Suppose that
n samples x1, . . . , xn are drawn i.i.d. from p(x|θ) . What is the MLE of θ given the
samples?
(ii) Consider the density p(x|θ) ∝
n
1 if − θ ≤ x ≤ θ
0 otherwise , for some θ 0. Suppose that
n samples x1, . . . , xn are drawn i.i.d. from p(x|θ). What is the MLE of θ given the
samples?
(iii) Recall the Gaussian density: p(x|µ, σ2
) := √
1
2πσ2
exp −(x−µ)
2
2σ2
?
, for some mean parameter µ ∈ R and variance parameter σ
2 0. Suppose that n samples x1, . . . , xn are
drawn i.i.d. from p(x|µ, σ2
). Show that if µ is unknown, then the MLE σ
2
ML is not an
unbiased estimator of the variance σ
2
for all sample sizes n. What simple modification
can we make to the estimate to make it unbiased?
(iv) Show that for the MLE θML of a parameter θ ∈ R
d
and any known differentiable function g : R
d → R
k
, the MLE of g(θ) is g(θML). From this result infer the MLE for the
standard deviation (σ) in the same setting as in Part (iii).
2 [Universality of Decision Trees]
(i) Show that any binary classifier g : {0, 1}
D → {0, 1} can be implemented as a decision
tree classifier. That is, for any classifier g there exists a decision tree classifier T with k
nodes n1, . . . , nk (each ni with a corresponding threshhold ti), such that g(x) = T(x)
for all x ∈ {0, 1}
D.
(ii) What is the best possible bound one can give on the maximum height of such a decision
tree T (from part (i))? For what function g is the bound tight?
3 [Designing the optimal predictor for continuous output spaces] We studied in class that
the “Bayes Classifier”
f := arg maxy P[Y |X]
?
is optimal in the sense that it minimizes
generalization error over the underlying distribution, that is, it maximizes Ex,y[1[g(x) = y]].
But what can we say when the output space Y is continuous?
Consider predictors of the kind g : X → R that predict a real-valued output for a given input
x ∈ X . One intuitive way to define the quality of of such a predictor g is as
Q(g) := Ex,y[(g(x) − y)
2
].
Observe that one would want a predictor g with the lowest Q(g).
Show that if one defines the predictor as f(x) := E[Y |X = x], then Q(f) ≤ Q(g) for any g,
thereby showing that f is the optimal predictor with respect to Q for continuous output spaces.
4 [Analyzing iterative optimization] In this problem, we will analyze the Richardson iteration
(from HW0) for finding β ∈ R
d
that (approximately) minimizes kAβ −bk
2
2
for a given matrix
A ∈ R
n×d
and vector b ∈ R
n
.
Recall the Richardson iteration, given as follows:
– Initially, β
(0) = (0, . . . , 0) ∈ R
d
is the zero vector in R
d
.
– For k = 1, 2, . . . , N:
- Compute β
(k)
:= β
(k−1) + ηAT(b − Aβ(k−1)).
Above, η 0 is a fixed positive number called the step size, and N is the total number of
iterations. Define M := ATA and v := ATb.
(i) Show that the matrix M is symmetric positive semi-definite.
Throughout, assume that the eigenvalues of M, denoted by λ1, . . . , λd, satisfy λi < 1/η
for all i = 1, . . . , d.
(ii) Prove (e.g., using mathematical induction) that, for any positive integer N,
β
(N) = η
N
X−1
k=0
(I − ηM)
k
v.
(Here, for a square matrix B, we have B0 = I, B1 = B, B2 = BB, B3 = BBB, and
so on.)
(iii) What are the eigenvalues of η
PN−1
k=0 (I−ηM)
k
? Give your answer in terms of λ1, . . . , λd,
η, and N.
(iv) Let βˆ be any vector in the range of M satisfying Mβˆ = v. Prove that
kβ
(N) − βˆk
2
2 ≤ e
−2ηλminN kβˆk
2
2
,
where λmin is the smallest non-zero eigenvalue of M.
Hint: You may use the fact that 1 + x ≤ e
x
for any x ∈ R.
This implies that as the number of iterations N increases, the difference between our
estimate β
(N)
and βˆ decreases exponentially!
5 [A comparative study of classification performance of handwritten digits] Download the
datafile hw1data.mat from the class website. This datafile contains 10,000 images (each of
size 28x28 pixels = 784 dimensions) of handwritten digits along with the associated labels.
Each handwritten digit belongs to one of the 10 possible categories {0, 1, . . . , 9}. There are
two variables in this datafile: (i) Variable X is a 10,000x784 data matrix, where each row is
a sample image of a handwritten digit. (ii) Variable Y is the 10,000x1 label vector where the
i
th entry indicates the label of the i
th sample image in X.
Special note for those who are not using Matlab: Python users can use scipy to read in
the mat file, R users can use R.matlab package to read in the mat file, Julia users can use
MAT.jl package.
To visualize this data (in Matlab): say you want to see the actual handwritten character image
of the 77th datasample. You may run the following code (after the data has been loaded):
figure;
imagesc(1-reshape(X(77,:),[28 28])’);
colormap gray;
To see the associated label value:
Y(77)
(i) Create a probabilistic classifier (as discussed in class) to solve the handwritten digit
classification problem. The class conditional densities of your probabilistic classifier
should be modeled by a Multivariate Gaussian distribution. It may help to recall that the
MLE for the parameters of a Multivariate Gaussian are:
~µML :=
1
n
Xn
i=1
~xi
ΣML :=
1
n
Xn
i=1
(~xi − ~µML)(~xi − ~µML)
T
You must submit your code via Courseworks to receive full credit.
(ii) Create a k-Nearest Neighbor classifier (with Euclidean distance as the metric) to solve
the handwritten digit classification problem.
You must submit your code via Courseworks to receive full credit.
(iii) Which classifier (the one developed in Part (i) or the one developed in Part (ii)) is better?
You must justify your answer with appropriate performance graphs demonstrating the
superiority of one classifier over the other. Example things to consider: you should
evaluate how the classifier behaves on a holdout ‘test’ sample for various splits of the
data; how does the training sample size affects the classification performance.
(iv) As discussed in class, there are several metrics one can use in a Nearest Neighbor classification. Do a similar analysis to justify which of the three metrics: L1, L2 or L∞ is
better for handwritten digit classification problem.
3
6 [Understanding model complexity and overfitting] Here we will empirically study the
tradeoff between model complexity and generalizability.
(i) Build a decision tree classifier for the digit dataset that we already introduced in Question
5. In building your decision tree, you may use any reasonable uncertainty measure to
determine the feature and threshold to split at in each cell. Make sure the depth of the
tree is adjustable with hyperparameter K.
You must submit your code via Courseworks to receive full credit.
(ii) Ensure that there is a random split between training and test data. Plot the training error
and test error as a function of K.
(iii) Do the trends change for different random splits of training and test data?
(iv) How do you explain the difference in the behavior of training and testing error as a
function of K?
(v) Based on your analysis, what is a good setting of K if you were deploy your decision
tree classifier to classify handwritten digits?
4