Starting from:

$30

CS322 Assignment 2

CS322 Assignment 2
Written Questions:
1. 4.1
2. 4.2
3. 4.3
4. Congrats! You’re the mayor of Northfield! As mayor — you are interested in mitigating damage from
potential flooding of the Cannon river. It is of the utmost importance that you begin preparations for
flood control early (e.g., ordering sandbags, etc.) if you believe that there is going to be a flood this year.
To help you make this prediction, you do the same thing you do every year: reach-out to two teams of
hydrologists from Carleton (team A, and team B) to get their opinions. For simplicity, assume that the
Cannon flooding in a year is a binary outcome, and that these same teams have made predictions for the
past 20 years. From your records, you know that over the past 20 years, the Cannon has flooded 4 times.
This time, team A predicts that the Cannon will not flood, and points-out that they have achieved 80%
accuracy in predictions over the past 20 years. On the other hand, Team B predicts that the Cannon
will flood, but, according to your records, they have only achieved 60% prediction accuracy over the past
20 years. Which team’s prediction should you base your decision off of? If you can’t decide — what
additional information (if any) would you like to know from the teams? Ideally, your answer should
include discussions of accuracy, precision, recall, constant prediction, etc.
5. Criminal trials in the United States can be viewed as classifiers: juries attempt to make predictions based
on evidence about the guilt or innocence of a defendant. Even well-intentioned juries make mistakes; the
Innocence Project, for example, is a non-profit organization that “is committed to exonerating wrongly
convicted people through the use of DNA testing;”1
they have helped to overturn over 300 wrongful
convictions. Blackstone’s Ratio,2 a legal philosophy credited to Juror William Blackstone, states:
It is better that ten guilty persons escape than that one innocent suffer.
Discuss how Blackstone’s ratio relates to precision and recall. In your discussion, you should answer the
question of whether Blackstone’s Ratio advocates for high-precision juries or high-recall juries.
6. Logistic regression estimates binary probabilities, given a feature vector x. The probability that y = 1 is
modeled by computing the sigmoid of a dot product of w and x plus a bias, i.e.: P(y = 1|x) = σ(w
T x+b)
and P(y = 0|x) = 1 − σ(w
T x + b). For a true datapoint (xi
, yi), we can compute the likelihood of the
logistic regression parameters w, b given yi (the true label) and xi (the vector of features) as
Likelihood(w, b|xi
, yi) = P(y = yi
|W, b, xi)
= σ(w
T xi + b)
yi
· (1 − σ(w
T xi + b))1−yi
Note that we have used yi
, the true label, as a “switch” variable. So — if yi
is 1, the likelihood of the
parameters is estimated by σ(w
T xi + b). If the true label is 1, then the higher w
T xi + b is, the more
likely W and b are. If yi
is 0, the likelihood of the parameters is estimated by 1 − σ(w
T xi + b), i.e., for
datapoints with true label 0, the smaller w
T xi + b is, the more likely the parameters w, b are.
Binary crossentropy is a loss function that attempts to maximize the log-likelihood of all of the datapoints
(because the datapoints are assumed to be independent draws, we can compute their joint probability as
a product, i.e. P((x1, y1),(x2, y2),(x3, y3). . . |w, b) is assumed to be equal to P(x1, y1|w, b)·P(x2, y2|w, b)·
P(x3, y3|w, b). . .. Maximizing log likelihood is equivalent to minimizing negative log likelihood. Letting
zi = w
T xi + b, the loss function (which we will minimize — lower loss is good!) for logistic regression
1https://www.innocenceproject.org/
2https://en.wikipedia.org/wiki/Blackstone%27s ratio
CS322, Spring 2019 Assignment 2
with binary crossentropy loss is:
− log(Likelihood(w, b|{xi
, yi})) = − log(Y
i
P(y = yi
|w, b, xi))
= −
X
i
log(P(y = yi
|w, b, xi))
= −
X
i
log(σ(zi)
yi
· (1 − σ(zi + b))1−yi
)
= −
X
i

yi
· log σ(zi) + (1 − yi) log(σ(−zi))
L({xi
, yi}, w, b) = X
i

− yi
· log σ(zi) − (1 − yi) log(σ(−zi))
Let the loss for datapoint i be defined as the term in the sum corresponding to datapoint i, i.e., Li =
− yi
· log σ(zi) − (1 − yi) log(σ(−zi))
where zi = w
T xi + b.
(a) When yi = 1, what is the algebraic expression for the loss Li
incurred by a given setting of w, b (in
terms of w, b, xi)?
(b) Say w, b were set such that w
T xi + b was 3 for this example. What is P(yi = 1|x)? What is the raw
value of the loss? Is this a good setting of the parameters for this example, or a bad setting?
(c) Say instead that w, b were set such that w
T xi + b was -5 for this example. What is P(yi = 1|x)?
What is the raw value of the loss? Is this a good setting of the parameters for this example, or a bad
setting?
(d) Say instead that w
T xi + b was still -5, but yi was actually 0. What is P(yi = 0|xi)? What is the
algebraic expression for the loss incurred on this example for a given setting of w, b? What is the raw
value of the loss?
(e) To minimize the negative log likelihood with respect to the vector w and the bias value b for a
datapoint (xi
, yi), we may use a gradient-based method like gradient descent. Thus — we need to
compute the gradient of the loss function with respect to each parameter. We will drop the subscript
i in this question for notational ease...
i. What is dL
dz =
d
dz
− y · log σ(z) − (1 − y) log(σ(−z))
? Potentially helpful facts: d
dz σ(z) =
σ(z) · (1 − σ(z)) and d
dz log(z) = 1
z
ii. What is dz
db =
d
db (w
T x + b)?
iii. What is dz
dwi
=
d
dwi
w
T x + b =
d
dwi
P
j wjxj + b?
iv. Wrapping up — what is the gradient of the loss L with respect to the parameter b? (Note that
the chain rule states: dL
db =
dL
dz
dz
db ).
v. Wrapping up — what is the gradient of the loss L with respect to the parameter wi? (Note that
the chain rule states: dL
dwi
=
dL
dz
dz
dwi
)
Now that you’ve derived the gradient of the loss with respect to each parameter, you could write your
own gradient descent loop, i.e., with learning rate α, wi
:= wi−α∗
dL
dwi
for each wi
, and b := b−α∗
dL
db .
You aren’t required to do so, but you did do the hardest parts!
7. Please complete intro.py which contains some numpy introduction exercises. Insert your answers in the
code (see the instructions at the top of the file).
CS322, Spring 2019 Assignment 2
Programming Assignment:
In this assignment, you will be implementing four document classifiers, and applying those classifiers to a
dataset of movie reviews. Your goal will be to predict whether a given movie review is positive or negative
based on the text of the review. You will then evaluate the performance of your classifiers using four metrics.
1. Implement classify doc hand design. This function takes in an input document, and checks to see if
any of the words in the argument valid words appear in the input. If a word appears in the input, e.g.,
good, then the word’s corresponding score (e.g., +1) is added to a running total “score” for the document.
Once all of the words have been checked, if the score is greater than zero, this function returns 1, otherwise,
it returns 0.
2. Implement a naive bayes classifier. My solution splits this process into two functions — one that precomputes probabilities for each word, i.e., it returns two dictionaries: one that maps a word to its smoothed
probability in the positive class, and one that maps a word to its smoothed probability in the negative
class. Additionally, it returns the number of positive and negative training documents there are. Next, I
wrote classify doc naive bayes, which takes in the output of the probability computing function, and
computes the summed log probability of each class (according to the naive bayes assumption).
3. Read over the function get logistic regression, which gets an sklearn model that performs logistic
regression. Note that we could have trained our own logistic regression model using gradient descent,
but this is much easier, and has some fancy features! Make sure you understand this code — check out
the sklearn documentation for LogisticRegressionCV, StandardScalar, and Pipeline. There are some
questions in the comments – please write python code that prints out the answer to them (or — in one
case, write a short answer in the comments).
4. Implement classify doc logistic regression, which takes in a document, the vocabulary mapping
words to indices, and the logistic regression model from get logistic regression model; the output
is the prediction of the model in the input document (using sklearn’s model.predict(x) will be helpful
here).
5. Implement the evaluation functions get accuracy, get precision, get recall, and get f1. These functions take in a numpy array of true labels and a numpy of predicted labels, and compute the corresponding
evaluation metric. While it’s not required, you should try to use vectorization as much as possible, here.
None of my implementations have an explicit python for loop, and none of them exceed 4 lines of code.
You don’t need to be that efficient, but I thought I would share, for reference!
6. What results accuracy, precision, recall, and f1 score did your models achieve? Report your results in a
results table like this (included are my numbers, for reference... but don’t just copy my results! I will be
checking your code to make sure it runs!).
accuracy precision recall f1
Constant Prediction 74.90 N/A N/A N/A
Hand-designed classifier 64.60 30.57 32.27 31.40
Naive Bayes 83.30 67.36 64.94 66.13
Logistic Regression 85.70 78.42 59.36 67.57

More products