$30
CSE512 – Machine Learning
Homework 2
This homework contains two questions. The first question is about parameter estimation. In the second question, you will implement logistic regression using stochastic gradient descent (SGD). The maximum number
of points is 100 points. For this homework, you should review some material on parameter estimation,
logistic regression, stochastic gradient decent, feature normalization, and performance measurement.
1 Question 1 – Parameter estimation (45 points)
This question uses a discrete probability distribution known as the Poisson distribution. A discrete random
variable X follows a Poisson distribution with parameter λ if
p(X = k|λ) = λ
k
k!
e
−λ
k ∈ {0, 1, 2, . . . }
The Poisson distribution is a useful discrete distribution which can be used to model the number of occurrences of something per unit time. For example, it can be used to model the number of customers arriving at
a bank per hour, or the number of calls handled by a telephone operator per minute.
1.1 Question 1.1 – MLE (15 points)
Assume the wait time for calling an Uber car is Poisson distributed (i.i.d) with parameter λ. You used Uber
seven times and record the wait times in each trip.
Trip 1 2 3 4 5 6 7
Wait time 4 12 3 5 6 9 17
Let X = (X1, . . . , Xn) be a random vector where Xi
is the number of delay minutes for your i
th trip:
1. (5 points) Give the log-likelihood function of X given λ.
2. (5 points) Compute the MLE for λ in the general case.
3. (5 point) Compute the MLE for λ using the observed X.
1.2 Question 1.2 – MAP (15 points)
Now let’s be Bayesian and put a prior over the parameter λ. You talk to your party-goer friends, who tell
you about the expected delays that they experience. You plot the distribution of the expected delays and your
extensive experience in statistics tells you that the plot resembles a Gamma distribution pdf. So you believe
a good prior distribution for λ may be a Gamma distribution. The Gamma distribution has pdf:
p(λ|α, β) = β
α
Γ(α)
λ
α−1
e
−βλ, λ > 0. (1)
Γ(α) is the Gamma function, which is the normalization factor that is introduced to have a proper probability
density function (i.e., sum to 1). Do not worry if you don’t know the explicit format of Γ(α); it is not
important for this question. Also, if λ ∼ Γ(α, β), then it has mean α/β and the mode is (α − 1)/β for
α > 1. Assume the prior distribution of λ is Γ(λ|α, β).
1. (8 points) Compute the posterior distribution over λ. Hint:
λ
Pn
i=1 Xi+α−1
e
−λ(n+β)
1
looks like a Gamma distribution! Is the rest of the expression constant with respect to λ? Working out
a messy integral can lead to the answer but it is unnecessary.
2. (7 points) Derive an analytic expression for the maximum a posterior (MAP) estimate of λ.
1.3 Question 1.3 – Estimator Bias (15 points)
In class, we saw that the maximum likelihood estimator for the variance of a Gaussian distribution:
σˆ
2
MLE =
1
N
X
N
i=1
(xi − µˆ)
2
is biased, and that an unbiased estimator of variance is:
σˆ
2
unbiased =
1
N − 1
X
N
i=1
(xi − µˆ)
2
For the Gaussian distribution, these estimators give similar results for large enough N, and it is unclear
whether one estimator is preferable to the other. In this problem, we will explore an example in which the
maximum likelihood estimate is dramatically superior to any unbiased estimator.
Suppose we are not quite interested in estimating λ. We are more interested in estimating a quantity that
is a nonlinear function of λ, namely η = e
−2λ
. Suppose we want to estimate η from a single observation
X ∼ P oisson(λ).
1. (5pts) Let ηˆ = e
−2X. Show that ηˆ is the maximum likelihood estimate of η.
2. (5pts) Show that the bias of ηˆ is e
−(1−1/e2
)λ − e
−2λ
. The following Taylor expansion may be useful:
e
x =
X∞
k=0
x
k
k!
3. (5pts) It turns out that (−1)X is the only unbiased estimate of η. Prove that it is indeed unbiased and
briefly explain why this is a bad estimator to use.
2 Question 2 – Logistic Regression (55 points)
In this question, you will implement Logistic Regression using Stochastic Gradient Descent (SGD). Suppose
the training data is {(X1
, Y 1
), · · · ,(Xn
, Y n
)}, where Xi
is a column vector of d dimensions and Y
i
is the
target label. For a column vector X, let X¯ denotes [X; 1], the vector obtained by appending 1 to the end
of X. θ is the set of parameters θ1, θ2, . . . , θk−1. Logistic regression for k classes assumes the following
probability function:
P(Y = i|X; θ) = exp(θ
T
i X¯)
1 + Pk−1
j=1 exp(θ
T
j X¯)
for i = 1, · · · , k − 1, (2)
P(Y = 0|X; θ) = 1
1 + Pk−1
j=1 exp(θ
T
j X¯)
. (3)
Logistic regression minimizes the negative average conditional log likelihood:
L(θ) = −
1
n
Xn
i=1
log(P(Y
i
|X¯i
; θ)). (4)
2
Here, log(·) is the natural logarithm. To minimize this loss function, we can use gradient descent:
θc = θc − η
∂L
∂θc
, for c = 1, . . . , k − 1 (5)
where ∂L
∂θc
= −
1
n
Xn
i=1
∂ log(P(Y
i
|X¯i
; θ))
∂θc
(6)
This gradient is computed by enumerating over all training data. This gradient can be approximated using a
batch of training data. Suppose B is a subset of {1, 2, · · · , n}
∂L
∂θc
≈ −
1
card(B)
X
i∈B
∂ log(P(Y
i
|X¯i
; θ))
∂θc
(7)
This leads to the following stochastic gradient descent algorithm:
Algorithm 1 Stochastic gradient descent for Logistic Regression
1: Inputs: {(Xi
, Y i
)}
n
i=1 (for data), m (for batch size), ηstart (initial learning rate), ηend (final learning
rate), max epoch (maximum number of epoches), (learning rate reduction criterion)
2: (i1, ..., in) = permute(1, ..., n)
3: Divide (i1, ..., in) into batches of size m or m + 1 . n might not be divisible by m,
4: . so some batches might have size m + 1
5: Randomly initialize θ
6: η ← ηstart
7: for epoch = 1, 2, ..., max epoch do
8: θ
old ← θ
9: for each batch B do
10: Update θ using Eqs. (5) and (7)
11: end for
12: if L(θ
old) − L(θ) < × L(θ
old) then // not much progress, try smaller learning rate
13: η ← η/10
14: end if
15: if η < ηend then Break
16: end if
17: end for
18: Outputs: θ.
3
2.1 Question 2.1 – Derivation (10 points)
Prove that:
∂ log(P(Y
i
|X¯i
; θ))
∂θc
= (δ(c = Y
i
) − P(c|X¯i
; θ))X¯i
. (8)
where δ(c = Y
i
) is the indicator function which takes a value of 1 if the class c equals the ground truth
label Y
i
, and 0 otherwise. Use Equation (8) to derive the gradient of the loss with respect to the parameters
θ1, θ2, . . . , θk−1.
2.2 Question 2.2 – Implementation (30 points)
Your task is to implement Logistic Regression with k classes using SGD.
2.2.1 (10 points)
Write a Python function with the signature:
[W, b] = logreg f it(X, y, m, eta start, eta end, epsilon, max epoch = 1000)
where
Inputs:
• X: a two dimensional Numpy array of size n × d, where n is the number of data points, and d the
dimension of the feature vectors.
• y: a Numpy vector of length n. y[i] is a categorical label corresponding to the data point X[i, :],
y[i] ∈ {0, 1, . . . , k − 1}. You can assume that the number of classes k is the maximum entry of y
plus 1.
• m, eta start, eta end, epsilon, max epoch are scalar parameters for the batch size, the starting learning rate, the ending learning rate, the learning rate reduction criterion, and the maximum number of
epochs as described in the algorithm above.
Outputs:
• W: a (k − 1)×d Numpy matrix, W[i, :] is the learned weight vector θi+1.
• b: a Numpy vector of length k − 1 for the biases.
2.2.2 (10 points)
Write a Python function with the following signature.
yˆ = logreg predict class(W, b, X)
This function takes the learned parameters W and b of a logistic regression classifier and test data X and
output the predicted categorical label yˆ for the test data X. See Question 2.2.1 for the expected formats for
W, b, X. The output yˆ should be a Numpy vector of length n, where n is the number of rows of data matrix
X. yˆ[i] is the predicted label for data point X[i, :].
4
2.2.3 (10 points)
Write a Python function with the following signature.
P = logreg predict prob(W, b, X)
This function takes the learned parameters W and b of a logistic regression classifier and test data X and
output the predicted probabilities. P is a Numpy matrix of size n × k, where P[i, j] is the probability that
X[i, :] belongs to class j.
2.2.4 Hints
Hint 1: You might want to implement Question 2.2.3 first. Questions 2.2.1 and 2.2.2 should make use of
2.2.3. You have to implement these functions yourself. You cannot use existing implementation in Python,
e.g., sklearn.linear model.LogisticRegression
Hint 2: To compute the probability values, you should use Eqs (2) and (3). To compute a probability
value of the form p =
exp(ai)
Pk−1
j=0 exp(aj )
, you can also use the equivalent formula: p =
exp(ai−a)
Pk−1
j=0 exp(aj−a)
, where
a = max{a0, . . . , ak−1}. In theory, the two formulations are the same, but the latter will help you avoid
numerical problems in the calculation. Note also that, exp(0) = 1.
Hint 3: Useful numpy functions that you can use for your implementation: np.dot, np.exp, np.log, np.sum,
np.mean.
2.3 Question 2.3 – Running and reporting results (15 points)
In this question, you will train a Logistic Regression model using the functions you implemented in Question 2, in order to predict whether a patient will recover from COVID-19. The COVID-19 dataset was compiled using the data provided by the Stony Brook University Hospital. This dataset contains real data from
patients that were diagnosed with COVID-19. For each patient, it includes a number of features, including
age, gender, blood pressure and other lab results collected at the admission time. All these are the features
can be used in order to forecast if a patient is likely to recover or not. This is a two-class classification
problem, and we will refer to the Death Outcome as the positive class.
Step 1
The first step is to load your training and test data that are provided in the files train.csv and test.csv
correspondingly. (Tip: To load the data from a csv file, you can use numpy.genfromtxt or csv.reader).
Inspect your data carefully: The first row of each csv file is the csv header. Each following row corresponds
to a patient. The columns correspond to the different measurements per patient. The last column indicates if
a patient died or not.
Load all the columns of train.csv except the last one to your feature matrix Xtrain and the last column to
your labels vector ytrain. Both Xtrain and ytrain should be numpy arrays. Convert Xtrain to float using
Xtrain = Xtrain.astype(np.float64) and ytrain to int using ytrain = ytrain.astype(int). Do the same for
your test data.
Step 2
In general, features can have different types of values and ranges, which can significantly affect the performance of the classifier. As a result, the first step after loading the data is to pre-process them and scale the
features in a similar range, using normalization techniques.
5
To normalize the features, you will use Z-score normalization (or Standardization). This method scales the
features by removing the mean µ and dividing by their standard deviation σ:
Xnorm =
X − µ
σ
You can use sklearn.preprocessing.StandardScaler for this step. In that case, you should use f it transform()
for Xtrain to fit and scale your training data, and then transform() for Xtest to scale the test data based
on the training mean and standard deviation. This is important, as the test data are seen as unknown during
training, so you should not normalize them using knowledge from them.
Step 3
Train a Logistic Regression classifier using your normalized training data Xtrain and ytrain, and the function
logreg f it() that you implemented in Question 2.2.1. Use m = 256, eta start = 0.01, eta end = 0.00001,
epsilon = 0.0001 and max epoch = 1000.
2.3.1 Question
(a) Plot the loss function, as given in Eq. (3), against the number of epochs during training. Comment on
what you see.
(b) Use the learned parameters (weights W and biases b) to predict the labels yˆtrain for the training set
Xtrain and yˆtest for the test set Xtest. For each set, calculate and report the following performance
metrics:
• The accuracy, i.e. the number of correctly predicted labels over the total number of samples.
(You can use sklearn.metrics.accuracy score).
• The confusion matrix (You can use sklearn.metrics.confusion matrix).
• The accuracy computed based on the diagonal of the confusion matrix.
(c) Use the learned parameters (weights W and biases b) to estimate the probabilities of the positive class
(y = 1) for the test set. Calculate and report the following performance metrics for the test set:
• The average precision (you can use sklearn.metrics.average precision score).
• Plot the precision recall curve (you can use sklearn.metrics.precision recall curve).
2.3.2 Question
(a) Remove the feature normalization and train the classifier again. Plot the loss function against the
number of epochs during training as in Question 2.3.1(a). Do you see any difference with scaling
and not scaling the data (e.g. in terms of the speed of convergence, the number of epochs, the final
performance)?
(b) With feature normalization, try different values for the hyperparameters eta start, eta end, max epoch
and m. Comment on what you see when increasing or decreasing these parameters. For example, what
impact has a higher learning rate on the training, the number of epochs or the final performance?
(c) Choose your own values for eta start, eta end, max epoch and m and train the classifier again. Plot
the loss function against the number of epochs during training as in Question 2.3.1(a). Why did you
choose these values?
6
3 What to submit
You will need to submit both your code and your answers to questions on Blackboard. Put the answer file
and your python code in a folder named: SUBID FirstName LastName (e.g., 10947XXXX Barack Obama).
Zip this folder and submit the zip file on Blackboard. Your submission must be a zip file,
i.e, SUBID FirstName LastName.zip.
The answer file should be named: hw2-answers.pdf. You can use Latex if you wish, but it is not compulsory.
The first page of the hw2-answers.pdf should be the filled cover page at the end of this homework. The
remaining of the answer file should contain:
1. Answers to Question 1
2. Answers to Questions 2.1 and 2.3
Your Python code must be named hw2.py. It should contain the requested functions, exactly as described in
Question 2. For automated testing, it should be possible for us to import your functions using ‘from hw2
import ...’. You can submit other python/data files if necessary. Your code hw2.py can also include other
functions if necessary.
Make sure you follow the instructions carefully. Your will lose points if:
1. You do not attach the cover page. Your answer file is not named hw2-answers.pdf
2. Your functions do not have the exact signatures as instructed
3. Your functions cannot be imported for automatic testing
4. Your functions crash or fail to run
4 Cheating warnings
Don’t cheat. You must do the homework yourself, otherwise you won’t learn. You cannot ask and discuss
with students from previous years. You cannot look up the solution online.
7
Cover page for answers.pdf
CSE512 Spring 2021 - Machine Learning - Homework 2
Your Name:
Solar ID:
NetID email address:
Names of people whom you discussed the homework with: