$29
Homework 01: Text Classification
CS 6501-005 Natural Language Processing
1. Suppose you have a single feature x, with the following conditional distribution:
p(x | y) =
α X = 0, Y = 0
1 − α X = 1, Y = 0
1 − β X = 0, Y = 1
β X = 1, Y = 1
(1)
Further suppose that the prior distribution is uniform, P(Y = 0) = P(Y = 1) = 0.5, and that both
α > 1
2
and β > 1
2
. Given a Naive Bayes classifier with accurate parameters, what is the probability
of making an error?
2. Suppose you have two labeled datasets D1 and D2, with the same feature set and labels
• Let θ
(1) be the unregularized logistic regression (LR) coefficients from training on dataset D1,
• Let θ
(2) be the unregularized logistic regression (LR) coefficients from training on dataset D2,
• Let θ
∗ be the unregularized logistic regression (LR) coefficients from training on dataset D1 ∪
D2.
Under these conditions, prove that for any feature j,
min(θ
(1)
j
, θ(2)
j
) ≤ θ
∗
j ≤ max(θ
(1)
j
, θ(2)
j
)
3. Let θˆ be the solution to an unregularized logistic regression problem, and let θ
∗ be the solution to
the same problem, with L2 regularization. Prove that kθ
∗k
2
2 ≤ kθˆk
2
2
.
4. Prove that F-measure is never greater than the arithmetic mean of precision and recall, p+r
2
. Your
solution should also show that F-measure is equal to p+r
2
if and only if p = r. [Hint: “if and only if”
means that you need to prove the statement in both directions. In other words, your solution needs
to show, with the definition of F-measure, both (1) p = r ⇒ F =
p+r
2
and (2) F =
p+r
2 ⇒ p = r
hold.]
5. In this assignment, you will be asked to build a logistic regression classifier for sentiment classification with the following files
• trn-reviews.txt: the Yelp reviews in the training set
• trn-labels.txt: the corresponding labels of the Yelp reviews in the training set
• dev-reviews.txt: the Yelp reviews in the development set
• dev-labels.txt: the corresponding labels of the Yelp reviews in the development set
The starting point of building a classifier is from the IPython notebook demo.ipynb. The first
section of this notebook provides a simple code to load the training and development set. Your
work starts from the second section.
1
• In the second section, you can implement the CountVectorizer function with different parameter settings, as shown in the two examples in this section.
• In the third section, try to pick different values of the parameters within function LogisticRegression
The task is to find the parameter setting used for both CountVectorizer and LogisticRegression,
which can give the best accuracy on the development set. The baseline accuracy is 61.4% with
uni-gram features and your results should be better than this number.
Your homework submission should include the IPython notebook with the name [Your-ComputingID].ipynb.
Please keep the best parameter setting only in the notebook, so we can easily reproduce the results.
2