Starting from:

$30

Lab 1: Classification with Multinomial and Gaussian Models

ECE368: Probabilistic Reasoning
Lab 1: Classification with Multinomial and Gaussian Models
1 Na¨ıve Bayes Classifier for Spam Filtering
In the first part of the lab, we use a Na¨ıve Bayes Classifier to build a spam email filter based on whether
and how many times each word in a fixed vocabulary occurs in the email. Suppose that we need to classify
a set of N emails, and each email n is represented by {xn, yn}, n = 1, 2, . . . , N, where yn is the class label
which takes the value
yn =
(
1 if email n is spam,
0 if email n is non-spam (also called ham),
(1)
and xn is a feature vector of the email n. We use a multinomial model to construct the feature vector xn.
Let W = {w1, w2, . . . , wD} be the set of the words (called the vocabulary) that appear at least once in the
training set. The feature vector xn is defined as a D-dimensional vector xn = [xn1, xn2, . . . , xnD], where
each entry xnd, d = 1, 2, . . . , D is the number of occurrences of word wd in email n. Thus the total number
of words in email n can be expressed as ln = xn1 + xn2 + . . . + xnD.
We assume that each email n of length ln is generated by a sequence of ln independent events that randomly
draw words from the vocabulary W. (This is known as the na¨ıve Bayes assumption.) For each event,
let p(wd | yn = 1) be the probability that word wd is picked, given that the email belongs to spam; let
p(wd | yn = 0) be the probability that word wd is picked, given that the email belongs to ham. Note that
p(wd | yn = 1) and p(wd | yn = 0) are different, which gives us a way to classify spam vs. ham. For example,
words like “dollar”,“winner” would be more likely to occur in spam than in ham. Also, note that both
p(wd | yn = 1), d = 1, 2, . . . , D and p(wd | yn = 0), d = 1, 2, . . . , D should sum to one, i.e.,
X
D
d=1
p(wd | yn = 1) = 1, (2)
X
D
d=1
p(wd | yn = 0) = 1. (3)
The probabilities p(wd | yn = 1), p(wd | yn = 0), d = 1, . . . , D should be learned from the training data.
We make use of the word frequencies to model each email n probabilistically. Since each word in the email
is seen as independently drawn from the vocabulary W, the distribution of the feature vector xn given label
yn can be seen as a multinomial distribution as follows,
p (xn | yn) = (xn1 + xn2 + . . . + xnD)!
(xn1)!(xn2)! . . .(xnD)!
Y
D
d=1
p (wd | yn)
xnd
. (4)
We assume that the prior class distribution p(yn) is modeled as
p(yn = 1) = π, (5)
p(yn = 0) = 1 − π, (6)
where π is a fixed parameter (e.g., 0.5).
1
In the following, we first estimate the probabilities p(wd | yn = 1), p(wd | yn = 0), d = 1, . . . , D using the
training set; we then build a classifier based on Bayes’ rule and make predictions on the testing set.
Download classifier.zip under Modules/Lab1/ on Quercus and unzip the file. The spam emails for training are
in the subfolder /data/spam/. The ham emails for training are in the subfolder /data/ham/. The unlabeled
emails for testing are in the subfolder /data/testing/.
Please answer the questions below and complete the routine classifier.py. File util.py contains a few functions/classes that will be helpful in writing the code for the classifier.
Questions
1. Training. We estimate the conditional probability distribution of the D-ary random variable as specified
by p(wd | yn = 1) and p(wd | yn = 0), d = 1, . . . , D, from the training data using a bag-of-words model
as follows. For notational simplicity, we define pd = p(wd | yn = 1) and qd = p(wd | yn = 0).
(a) We put all the words from the spam emails in the training set in a bag and simply count the
number of occurrences of each word wd, d = 1, · · · , D. We do the same for ham emails. The
maximum likelihood estimates of pd and qd based on these counts are not the most appropriate to
use when the probabilities are very close to 0 or to 1. For example, some words that occur in one
class may not occur at all in the other class. In this problem, we use the technique of “Laplace
smoothing” to deal with this problem. Please write down such an estimator for pd and qd as
functions of the training data {xn, yn}, n = 1, 2, . . . , N using Laplace smoothing for the D-ary
random variable.
(b) Complete the function learn distributions in file classifier.py. In learn distributions, you first build
the vocabulary {w1, . . . , wD} by accounting for all the words that appear in the training set at
least once; you then estimate pd and qd, d = 1, 2, . . . , D using your expressions in part (a).
2. Testing. We classify the unlabeled emails in /data/testing/ using the trained classifier.
(a) Let {x, y} be a data point from the testing set whose class label y is unknown. Write down the
maximum a posterior (MAP) rule to decide whether y = 1 or y = 0 based on the feature vector
x. The d-th entry of x is denoted by xd. Please incorporate pd and qd in your expression. Please
think carefully how to treat words that do not appear in either ham or spam training sets. Please
assume that π = 0.5.
(b) Complete the function classify new email in file classifier.py to implement the MAP rule, and run
it on the testing set. There are two types of errors in classifying unlabeled emails: Type 1 error
is defined as the event that a spam email is misclassified as ham; Type 2 error is defined as the
event that a ham email is misclassified as spam. Write down the numerical values of these two
numbers of errors made by your classifier on the testing data. To avoid numerical underflow in
your code, please work with the log probability log p(y|x) in your code.
(c) In practice, Type 1 error and Type 2 error lead to difference consequences (or costs). Therefore, we
may wish to trade off one type of error against the other in designing the classifier. For example,
we usually want to achieve a very low Type 2 error since the cost of missing a useful email can
be severe, while we can tolerate a relative high Type 1 error as it merely causes inconvenience.
Please provide a way to modify the decision rule in the classifier such that these two types of
error can be traded off. In other words, change the decision rule in a way such that Type 2 error
would decrease at a cost of Type 1 error, and vice versa. Test your method on the testing set and
provide the following plot: Let the x-axis be the number of Type 1 errors and the y-axis be the
number of Type 2 errors in the testing data set. Plot at least 10 points corresponding to different
pairs of Type 1 and Type 2 errors, as a result of adjusting the classification rule. The two end
points of the plot should be: 1) the one with zero Type 1 error; and 2) the one with zero Type 2
error. The code should be included in file classifier.py.
2
(d) Why do we need Laplace smoothing? Briefly explain what would go wrong if we use maximum
likelihood estimation in the training process, by considering a scenario in which a testing email
contains both a word w1 that appears only in the ham training set (but not in the spam training
set), and a word w2 that appears only in the spam training set (but not in the ham training set).
How does Laplace smoothing resolve this issue?
The training and test data for this problem are taken from V. Metsis, I. Androutsopoulos and G. Paliouras,
“Spam Filtering with Naive Bayes – Which Naive Bayes?” Proceedings of the 3rd Conference on Email and
Anti-Spam (CEAS 2006), Mountain View, CA, USA, 2006.
2 Linear/Quadratic Discriminant Analysis for Height/Weight Data
When the feature vector is real-valued (instead of binary), a Gaussian vector model is appropriate. In this
part of the lab, we use linear discriminant analysis (LDA) and quadratic discriminant analysis (QDA) for
the height/weight data of people, and visualize the classification results of male and female persons based
on height and weight.
Suppose that the data set contains N samples. Let xn = [hn, wn] be the feature vector, where hn denotes
the height and wn denotes the weight of a person indexed by n. Let yn denote the class label. Here yn = 1
is male, and yn = 2 is female. We model the class prior as p(yn = 1) = π and p(yn = 2) = 1 − π. For this
problem, let π = 0.5.
For the class conditional distributions, let µm be the mean of xn if class label yn is male, and let µf be the
mean of xn if class label yn is female. For LDA, a common covariance matrix is shared by both classes, which
is denoted by Σ; for QDA, different covariance matrices are used for male and female, which are denoted by
Σm and Σf , respectively.
Download ldaqda.zip under Modules/Lab1/ on Quercus and unzip the file. The data set for training is in file
trainHeightWeight.txt, whereas the data set for testing is in file testHeightWeight.txt. Each file uses the same
format to represent the data: the first column corresponds to the class labels, the second column corresponds
to the heights, and the third column corresponds to the weights.
Please answer the questions below and complete function ldaqda.py. File util.py contains a few functions/classes that will be useful in writing the code.
Questions
1. Training and visualization. We estimate the parameters in LDA and QDA from the training data in
trainHeightWeight.txt and visualize the LDA/QDA model.
(a) Please write down the maximum likelihood estimates of the parameters µm, µf
, Σ, Σm, and Σf
as functions of the training data {xn, yn}, n = 1, 2, . . . , N. The indicator function I(·) may be
useful in your expressions.
(b) Once the above parameters are obtained, you can design a classifier to make a decision on the
class label y of the new data x. The decision boundary can be written as a linear equation of x
in the case of LDA, and a quadratic equation of x in the case of QDA. Please write down the
expressions of these two boundaries.
(c) Complete function discrimAnalysis in file ldaqda.py to visualize LDA and QDA. Please plot one
figure for LDA and one figure for QDA. In both plots, the horizontal axis is the height with range
[50, 80] and the vertical axis is the weight with range [80, 280]. Each figure should contain: 1) N
colored data points {xn, n = 1, 2, . . . , N} with the color indicating the corresponding class labels
3
(e.g., blue represents male and red represents female); 2) the contours of the the conditional Gaussian distribution for each class (To create a contour plot, you need first build a two-dimensional
grid for the range [50, 80] × [80, 280] by using function np.meshgrid. You then compute the conditional Gaussian density at each point in the grid for each class. Finally use function plt.contour,
which takes the two-dimensional grid and the conditional Gaussian density on the grid as inputs
to automatically produce the contours.); 3) the decision boundary, which can also be created by
using plt.contour with appropriate contour level.
2. Testing. We test the obtained LDA/QDA model on the testing data in testHeightWeight.txt. Complete
function misRate in file ldaqda.py to compute the misclassification rates for LDA and QDA, defined as
the total percentage of the misclassified samples (both male and female) over all samples.
The data for this problem are taken from: K. Murphy, Machine Learning: A Probabilistic Approach, MIT
Press, 2012.
4

More products