Starting from:

$30

Assignment 7 Naive Bayes, Perceptrons

Artificial Intelligence Assignment 7
Download assignment7.zip from Canvas. There are four problems worth a total of 195 points plus 25 extra credit points for comp440
and 220 points for comp557. Problems 1 and 4 require written work only, Problems 2 and 3 require Python code in two folders and a writeup. All written work should be placed in a file called
writeup.pdf with problem numbers clearly identified. All code should be included at the labeled
points in submission2.py and submission3.py in the two code folders. For Problems 2 and 3,
please run the autograder using the command line python grader.py in each folder, and report the
results in your writeup. Please upload writeup.py, submission2.py and submission3.py onto
Canvas by the due date/time.
Problem 1: Naive Bayes, Perceptrons, Decision Trees, Neural Networks (45 points + 25 EC points for comp440/required points for
comp557)
The data set below is a subset of a census database (the full data is available at
ftp://ftp.ics.uci.edu/pub/machine-learning-databases/adult). Your goal is to predict whether
an individual earns more or less than 50K a year based on education, gender and citizenship. Education is a discrete-valued attribute which can take one of three values; BS, MS, PhD; gender takes
one of two values: male, female; and citizenship has two values: US and nonUS. You will use four
different algorithms to learn models over this training data set.
Education Gender Citizenship Income
BS male US ≤50K
MS male nonUS 50K
BS female US ≤50K
PhD male nonUS ≤50K
MS female US 50K
PhD female nonUS ≤50K
BS male US ≤50K
PhD male US 50K
BS female nonUS ≤50K
PhD female US 50K
Test your models on the following test set.
Education Gender Citizenship
PhD male US
PhD male nonUS
MS female nonUS
1
2 a. Naive Bayes (10 points)
• (7 points) Estimate the priors for each income class and the class conditional probabilities for
each of the attributes in the training data. That is, calculate P(≤50K), P(50K), and fill in
the following conditional probability tables.
Education P(Education|≤50K) P(Education| 50K)
BS
MS
PhD
Gender P(Gender|≤50K) P(Gender| 50K)
male
female
Citizenship P(Citizenship|≤50K) P(Citizenshipr| 50K)
US
nonUS
• (3 points) How would the Naive Bayes classifier you estimated in the previous step classify the
income of the individuals in the test set? That is, fill in the income column of the following
table with values ≤50K, and 50K.
Education Gender Citizenship Income
PhD male US
PhD male nonUS
MS female nonUS
b. The perceptron algorithm (10 points)
• (1 point) How would you encode the training data as inputs to a perceptron?
• (8 points) Initialize all weights to zero and perform one pass of perceptron training on the
training set, doing updates in the order in which the observations are presented. Present your
result in a table, showing the weight vector after processing each example.
• (1 point) Will the perceptron converge if you run it long enough? Justify your answer.
c. Decision Trees (10 points)
• (3 points) Calculate the information gain of each of the three features. Which feature should
be chosen as the root of the decision tree?
3 • (4 points) Now continue with the decision tree construction process with the root attribute
chosen above. Determine the attribute to split on for the next level of nodes in the decision
tree. Show your information gain calculations. Continue the construction process till all the
leaf nodes have instances belonging to the same Income class. Show the final (unpruned) tree
that you have constructed.
• (3 points) What does your decision tree predict on the three examples in the test set? That
is, fill out the following table.
Education Gender Citizenship Income
PhD male US
PhD male nonUS
MS female nonUS
d. Neural networks (15 points)
• (2 points) How would you encode the data as inputs to a feedforward neural network?
• (9 points) Code this example with sklearn’s MLP implementation, and train the network on
the given training set. Documentation is available at
http://scikit-learn.org/stable/modules/neural networks supervised.html You will
need to set up two numpy arrays for training trainX and trainy corresponding to the 10
training examples. You will also need to set up the test set as the array testX. You will
have to make a choice of the number of hidden units. Try the range 2, . . . , 5 and report the
cross-validated training error.
• (4 points) What are the predictions made on the test set? Compare the classifications of the
test cases made by the four classifiers and comment on their similarities and differences.
e. Using the full data set (25 points)(optional for comp440/required for comp557)
Download the full data from ftp://ftp.ics.uci.edu/pub/machine-learning-databases/adult.
Your goal is to predict whether an individual earns more or less than 50K a year based on education,
gender, citizenship as well as other features in the full data set. Use sklearn to build naive Bayes,
perceptrons, decision trees and multi layer neural networks on this data set. How predictable is
income from these features? Produce a table with columns learning-method, features used, training
error, test error and comment on your results.
Problem 2: Text classification (70 points)
According to Microsoft, circa 2007, 97% of all email is unwanted spam. Fortunately, most of us
avoid the majority of these emails because of well-engineered spam filters. In this assignment, we
will build a text classification system that, despite its simplicity, can identify spurious email with
impressive accuracy. You will then apply the same techniques to identify positive and negative
product reviews and to classify email posts into topical categories.
Problem 2.1: Spam classification (40 points) 4
Let us start by building a simple rule-based system to classify email as spam or ham. To test our
system, we will be using a corpus of email made publicly available after the legal proceedings of the
Enron collapse; within the data/spam-classification/train directory, you will find two folders
spam and ham. Each folder contains a number of full text files that contain the whole email without
headers and have been classified as spam and ham respectively.
In order to implement your own spam classifier, you will subclass Classifier and implement the
classify method appropriately in submission2.py. As usual, grader.py contains a number of
simple test cases for your code. To run, type python grader.py on the command line.
Additionally, we have provided a script main.py to help you interactively run your code with different parameters; be ready to change this script to use alternate features, print different statistics,
etc. It is really just meant to help you get started. To use this script, type python main.py
part<part, using the section number (2.1, 2.2, etc.). python main.py part<part -h lists additional parameters you might find useful. The script will output the classification error rate as
well as a confusion matrix for the training and development set. Each cell of the confusion matrix,
(row, column), of the matrix contains the number of elements with the true label row that have
been classified as column.
Problem 2.1.1: Rule-based system (10 points)
• (5 points) data/spam-classification/blacklist.txt contains a list of words sorted by
descending spam correlation. Implement RuleBasedClassifier by discarding any email
containing at least one word from this list. It is recommended you store the blacklist as a
set to reduce lookup time. For comparison, our system achieved a dev error rate of about
48% on the training set with this heuristic. You can test this by running python main.py
part3.1.1 at the command line. If you run python grader.py you should get something
like this (your times may vary).
========== START GRADING
----- START PART 2.1.1a-0
----- END PART 2.1.1a-0 [took 0:00:00.032256, 2/2 points]
----- START PART 2.1.1a-1
trainErr: 0.474339810663
devErr: 0.480730223124
----- END PART 2.1.1a-1 [took 0:00:08.852822, 3/3 points]
• (5 points) Relax this heuristic by only discarding email that contains at least n or more of
the first k listed words. You can test this by running python main.py part2.1.1 -n 1 -k
10000 at the command line for n = 1 and k = 10000. Report your dev error results on the
training set in a 3x3 table with n = 1, 2, 3 and k = 10000, 20000, 30000.
Problem 2.1.2: Linear classifiers (10 points) 5
As you have observed, this naive rule-based system is quite ineffective. A reasonable diagnosis is
that each word, whether it is viagra or sale, contributes equally to the ’spaminess’ of the email.
Let us relax this assumption by weighing each word separately.
Consider producing a spaminess score, fw(x) for each email document x which sums the ’spaminess’
scores for each word in that document;
fw(x) = X
L
i=1
w(wordi)
where L is the length of the document x, and w is the ’spaminess’ score for word wordi
. We can
use a linear classifier that will classify the email as spam if fw(x) ≥ 0, and as ham, otherwise.
Note that the order of words does not matter when computing the sum fw(x). Thus, it is convenient
to represent a document as a sparse collection of words. An ideal data structure for this is a hash
map or dictionary. For example, the document text The quick dog chased the lazy fox over
the brown fence, would be represented using the dictionary, { brown: 1, lazy: 1, fence: 1, fox:
1, over: 1, chased: 1, dog: 1, quick: 1, the: 2, The: 1}.
Let the above vector representation of a document be φ(x); in this context, the vector representation
is called a feature. The use of individual words or unigrams is a common choice in many language
modelling tasks. With this vector or feature representation, the spaminess score for a document is
simply the inner product of the weights and the document vector fw(x) = w.φ(x).
Let the positive label be represented as a 1. Then, we can write the predicted output ˆy of the
classifier as
yˆ =
(
+1 if w.φ(x) ≥ 0
−1 if w.φ(x) < 0
• (5 points) Implement a function extractUnigramFeatures that reads a document and returns
the sparse vector φ(x) of unigram features. If you run python grader.py, you should see
the following corresponding to a correct implementation of this part.
----- START PART 2.1.2a
----- END PART 2.1.2a [took 0:00:00.000072, 5/5 points]
• (5 points) Implement the classify function in WeightedClassifier. You can test whether
you have implemented this correctly by running python grader.py – you should see:
----- START PART 2.1.2b
----- END PART 2.1.2b [took 0:00:00.000055, 5/5 points]
Problem 2.1.3: Learning to distinguish spam (20 points)
The next question we need to address is where the vector of spaminess scores, w, comes from.
Our prediction function is a simple linear function, so we will use the perceptron algorithm to
6 learn weights. The perceptron algorithm visits each training example and incrementally adjusts
weights to improve the classification of the current labelled example (x, y). If ˆy is the prediction
the algorithm makes with the current set of weights w
(t)
, i.e., ˆy = I(w
(t)
.φ(x) ≥ 0), then if ˆy 6= y,
increment w
(t) by y × φ(x).
• (10 points) Implement learnWeightsFromPerceptron that takes as input a corpus of training
examples, and returns the w learned by the perceptron algorithm. Initialize your weights
uniformly with 0.0. If you run python grader.py you should see the following:
----- START PART 2.1.3a-0
----- END PART 2.1.3a-0 [took 0:00:00.047040, 0/0 points]
----- START PART 2.1.3a-1
trainErr: 0.00896860986547
devErr: 0.0486815415822
----- END PART 2.1.3a-1 [took 0:00:06.147842, 10/10 points]
Our reference implementation with unigram features had an error rate of about 0.048 on the
development set.
• (5 points) So far, we have looked at scores for single words or unigrams. We will now consider
using two adjacent words, or bigrams as features by implementing extractBigramFeatures.
To handle the edge case of a word at the beginning of a sentence (i.e. after a punctuation like
’.’, ’!’ or ’?’), use the token -BEGIN-. On the previous example, The quick dog chased the
lazy fox over the brown fence, extractBigramFeatures would return, {the brown: 1,
brown: 1, lazy: 1, fence: 1, brown fence: 1, fox: 1, over: 1, fox over: 1, chased: 1, dog:
1, lazy fox: 1, quick dog: 1, The quick: 1, the lazy: 1, chased the: 1, quick: 1, the:
2, over the: 1, -BEGIN- The: 1, dog chased: 1}.
Run python grader.py and you should see the following:
----- START PART 2.1.3b-0
----- END PART 2.1.3b-0 [took 0:00:00.000146, 5/5 points]
• (5 points) Vary the number of examples given to the training classifier in steps of 500 from
500 to 5,000. Provide a table of results showing the training and development set classification error when using bigram features. How did the additional features help the training
and development set error? You can run python main.py part2.1.3 to get the results for
producing the table.
Problem 2.2: Sentiment classification (10 points)
You have just constructed a spam classifier that can identify spam with a 96% accuracy. While this
in itself is great, what’s really impressive is that the same system can easily be used to learn how to
tackle other text classification tasks. Let us look at something that is completely different; identifying positive and negative movie reviews. We have provided a dataset in data/sentiment/train,
with labels pos and neg.
7 • (5 points) Use the perceptron learning algorithm you wrote in the previous section to classify
positive and negative reviews. Report the training and development set error rate for unigram
features and bigram features. You can run python main.py part2.2 to get these results.
You can modify the function part2 in main.py if you want to change parameters.
• (5 points) Make a table of the training and development set classification errors as you vary
the number of iterations of the perceptron algorithm from 1 to 20. Use bigram features for
this part. Use main.py for this part as before. Does development set error rate monotonically
decrease with iteration number? Why or why not? You might find it useful to write another
version of learnWeightsFromPerceptron that prints training and development error in each
iteration. Optionally, you might try plotting the errors to visually see how the two errors
behave. We recommend the matplotlib library. Include these plots in your writeup if you
make them.
Problem 2.3: Document categorization (20 points)
Finally, let’s see how this system can be generalized to tasks with multiple labels. We will apply our
knowledge to the task of categorizing emails based on topics. Our dataset in data/topics/train
contains a number of emails from 20 popular USENET groups that have been segregated into 5
broader categories.
• (15 points) A common approach to multi-class classification is to train one binary classifier
for each of the categories in a ”one-vs-all” scheme. Namely, for the binary classifier i, all
examples labelled i are positive examples, and all other examples are negative. When using
this classifier for prediction, one uses the class label with the largest score, fi(x). Implement
the OneVsAllClassifier classifier (and MultiClassClassifier). In order to train this
classifier, you will also need to implement learnOneVsAllClassifiers that will appropriately
train binary classifiers on the provided input data.
If you run python grader.py, you will see the following:
----- START PART 2.3a-0
----- END PART 2.3a-0 [took 0:00:00.000141, 15/15 points]
• (5 points) Report the train and development set error rate for unigram features and bigram
features in your writeup. You can generate these by modifying function part3 in main.py.
If you run python main.py part2.3 with our default part3, you should see a development
error rate of about 12%.
Problem 3: Image classification (60 points)
In this assignment, you will implement an image classifier that distinguishes birds and airplanes.
We will use the perceptron algorithm as a starting point but what should the features (arguably
the most important part of machine learning) be? We will see that rather than specifying them
by hand, as we did for spam filtering, we can actually learn the features automatically using the
8
Figure 1: The sixteen patches of size 8x8 pixels corresponding to an image of size 32x32 pixels.
K-means clustering algorithm. We will be working with the CIFAR-10 dataset, one of the standard
benchmarks for image classification. We assume that your version of Python has numpy and PIL
packages already installed. If you do not have these packages, you can install them fairly easily
from distributions downloadable from the Web. If you use Enthought Python, these packages come
pre-installed.
Warmup
Now we will get you familiar with the classification task. You will be classifying a 32 × 32 image
as either a bird ( y = 1) or a plane (y = 0). One way to do this is to train a classifier on the raw
pixels, that is, φ(x) ∈ <32×32 vector where each component of the vector represents the intensity
of a particular pixel. Try running this classifier with
python run.py --pixels
As you can see, these features do not work very well, the classifier drastically overfits the training
set and as a result barely generalizes better than chance.
The problem here is that pixel values themselves are not really meaningful. We need a higher-level
representation of images. So we resort to the following strategy: Each image is a 32 × 32 grid of
pixels. We will divide the image into sixteen 8 × 8 ”patches”. Now comes the key part: we will
use K-means to cluster all the patches into centroids. These centroids will then allow us to use a
better feature representation of the image which will make the classification task much easier.
9
Figure 2: 20 centroids learned from K-means on patches from the first 1000 training images..
Implementing the K-means algorithm (20 points)
In this part of the assignment you will implement K-means clustering to learn centroids for a given
training set. Specifically you should fill out the method runKMeans in the file submission3.py.
Let D = {x1, . . . , xn} be a set of data, where xi ∈ <d
. We will construct K clusters with means
µ1, . . . , µK where µj ∈ <d
.
• Initialize a set of means µ1, µ2, . . . , µK.
• for i in 1..maxIter
– Assignment step: assign each xi to the closest cluster mean zi ∈ {µ1, . . . , µK}.
zi = argmink||xi − µk||2
– Update step: given the cluster assignments zi
, recalculate the cluster means µ’s.
muk =
1
P
zi=k
1
X
zi=k
xi
We start you off by initializing K-means with random centroids where each floating point number
is chosen from a normal distribution with mean 0 and standard deviation 1. Test the K-means
code with the provided test utility in grader.py by running:
python grader.py
Optional: One way to determine if your K-means algorithm is learning sensible features is to view
the learned centroids using our provided utility function. To view the first 25 learned centroids,
run
10 python run.py --view
Your centroids should look similar to Figure 2. Notice how the centroids look like edges. Important
note on patches: Images are composed of patches which have been converted to gray-scale followed
by standard image preprocessing. This includes normalizing them for luminance and contrast as
well as further whitening.
Feature Extraction (20 points)
The next step in the pipeline is to use the centroids to generate features that will be more useful
for the classification task then the raw pixel intensities. One way to do this is to represent each
patch by the distance to the centroids that are closest to it, the intuition being that we can encode
a patch by the centroids to which it is most similar. We will map each image x to its new feature
vector φ(x) ∈ <16k
, where there is a real value for each patch, centroid combination.
Let pij ∈ <64 be the ij-th patch of x where i, j = 1, . . . , 4. The relative activation, aijk, of centroid
µk by patch pij is defined to be the average Euclidean distance from pij to all centroids minus the
Euclidean distance from pij to µk. The Euclidean distance is the L2 norm, ||v||2 =

v
T v.
aijk =
1
K


X
K
k
0=1
||pij − µk
0 ||2

 − ||pij − µk||2
The feature value for patch pij and centroid µk is the max of the relative activation and zero.
φijk(x) = max(aijk, 0)
Implement the function extractFeatures in submission3.py. We will use these features in the
linear classifier below.
Supervised Training (20 points)
The final task is to use our better feature representation to classify images as birds or planes. We
have provided you with a working Perceptron classifier which does fairly well. You will implement
the logistic and hinge loss updates to learn the parameters and see if either of these improves
classification accuracy and which does better. First you can run the Perceptron classifier to see
how it performs on the test set:
python run.py --gradient=perceptron
You should see a test accuracy result between 60%-65% - we can do better!
• (10 points) Implement the logisticGradient method in submission3.py. You can test this
method with python grader.py. Given example (φ(x), y) where φ(x) ∈ <d and y ∈ {0, 1},
and parameter vector w, define yy = 2 ∗ y − 1 to map {0, 1} to {−1, 1}. The logistic loss
function is
Losslogistic(w, φ(x), yy) = log(1 + e
−wT φ(x)∗yy)
Compute the gradient of this function with respect to 11 w and complete the logisticGradient
method.
• (10 points) Implement the hingeLossGradient method in submission3.py. The hinge loss
function is
Losshinge(w, φ(x), yy) = max(1 − w
T φ(x) ∗ yy, 0)
Compute the gradient of this function with respect to w and complete the hingeLossGradient
method. You can test this method with python grader.py.
Now you are ready to run the entire pipeline and evaluate performance. Run the full pipeline with:
python run.py --gradient=<loss function
The loss function can be any of {”hinge”,”logistic”,”perceptron”}. The output of this should be
the test accuracy on 1000 images. One benefit of using an unsupervised algorithm to learn features
is that we can train the algorithm with unlabeled data which is much more easily obtained than
labeled data. We have included an option to train the K-means algorithm using 500 extra unlabeled
images. You can run this with
python run.py --gradient=hinge -m
The performance of the supervised classifier goes up even though we are not using labeled data –
a major benefit to using an unsupervised algorithm to learn a feature representation!
Numpy cheat sheet
This is a quick overview of the functions you will need to complete this assignment. A really
thorough introduction is at http://wiki.scipy.org/Tentative NumPy Tutorial.
All standard operations on vectors, matrices and n-dimensional arrays are element-wise in numpy.
For example
A = numpy.random.randn(5,5)
B = numpy.random.randn(5,5)
A+B # element-wise addition
A*B # element-wise multiplication
You can also access individual elements, columns and rows of A using the following notation,
A[0,0] # first element of the first row of A
A[:,1] # second column of A
A[3:5,:] # fourth and fifth rows of A
In order to take the matrix product of A and B use the numpy.dot function.
numpy.dot(A,B) # matrix multiplication A*B
A.dot(B) # same as above
To find the minimum or maximum element in an array or along a specific axis of an array (rows or 12
columns for 2D), use numpy.minimum or numpy.maximum
numpy.minimum(A) # min of A
numpy.minimum(A, axis=0) # min of each column of A
numpy.maximum(A, axis=1) # max of each row of A
To take the indices of the minimum element in each row or column of a 2D array use the numpy.argmin
function and specify the axis
numpy.argmin(A,axis=0) # argmin of each column
Similarly you can take the mean along the columns or rows of a 2D array using numpy.mean and
the sum along a specific axis using numpy.sum
numpy.mean(A,axis=0) # mean of each column of A
numpy.sum(A,axis=1) # sum of each row of A
Problem 4: Relating Naive Bayes classifiers and perceptrons (20
points)
Consider a naive Bayes classifier with binary-valued features, i.e. fi ∈ {0, 1} for a two class problem
{+1, −1}. Prove that it can be represented by a perceptron with weights wi
, i = 0, 1, . . . , n such that
the decisions made by the naive Bayes classifier are the same as the ones made by the perceptron.
The weights of the perceptron should be expressed in terms of the naive Bayes probabilities: P(y),
P(fi
|y = 1), and P(fi
|y = −1). Assume that all these probabilities are non-zero.

More products