$29.99
Assignment 4: Learning
In this assignment, we’ll consider the problem of classifying text documents using various machine learning approaches.
Part 1: Spam classification
Let’s start by considering a straightforward document classification problem: predicting whether or not an e-mail is spam. We’ll use a bag-of-words model, which means that we’ll represent a document in terms of just an unordered “bag” of words instead of modeling anything about the grammatical structure of the document. In other words, a document can be modeled as simply a histogram over the words of the English language. If, for example, there are 100,000 words in the English language, then a document can be represented as a 100,000-dimensional vector, where the entries in the vector may correspond to binary values (1 if the word appears in the document and 0 otherwise), an integer (e.g., number of times the word appears in the document), or a real number (e.g., ratio of number of times the word appears in the document over total number of words in the document). Of course, most vectors will be sparse (most entries are zero).
1. First, implement a Naive Bayes classifier for this problem. For a given document D, we’ll need to evaluate P(S = 1|w1,w2,...,wn), the posterior probability that a document is spam given the features (words) in that document. Make the Naive Bayes assumption, which says that for any i 6= j, wi is independent from wj given S. Hint: It may be more convenient to evaluate the likelihood (or “odds”) ratio of P(S=1|w1,...wn) P(S=0|w1,...,wn) , and compare that to a threshold to decide if a document is spam or non-spam.
2. Implement a Decision Tree for this problem, building the tree using the entropy-based greedy algorithm we discussed in class.
To help you get started, we’ve provided a dataset in your repo of known spam and known non-spam emails, split into a training set and a testing set. For both the Bayesian and Decision Tree problems, train your model on the training data and measure performance on the testing data in terms of accuracy (percentage of documents correctly classified) and a confusion matrix. In general, a confusion matrix is an n×n table, where n is the number of classes (n = 2 in this case), and the entry at cell row i and column j should show the number of test exemplars whose correct label is i, but that were classified as j.
For each of the two models, consider two types of bag of words features: binary features which simply record if a word occurs in a document or not, and continuous features which record the frequency of the word in the document (as raw count, or a percentage, or some other function that you might invent). For the Bayesian classifier, your training program should output the top 10 words most associated with spam (i.e. the words with highest P(S = 1|w)) and the 10 words least associated with spam (words with lowest P(S = 0|w)). For the decision tree, your training program should display a representation (e.g. a simple text visualization) of the top 4 layers of the tree.
Your program should accept command line arguments like this:
./spam mode technique dataset-directory model-file
where mode is either test or train, technique is either bayes or dt, dataset-directory is a directory containing two subdirectories named spam and notspam, each filled with documents of the corresponding type, and model-file is the filename of your trained model. In training mode, dataset-directory will be the training dataset and the program should write the trained model to model-file, and in testing mode dataset-directory will be the testing dataset and your program should read its trained model from model-file. You can structure your model-file format however you’d like. We have not prepared skeleton code this time, so you may also prepare your source code however you’d like.
In your report, show results for two type of features with each of the two classifiers. Which techniques work best?
2
Part 2: Topic classification
Now let’s consider a multi-class task: estimating the genre or topic of a document. To get you started, we’ve prepared a dataset of documents organized into twenty broad topics (e.g. religion, finance, etc.). But there’s a twist. It’s often the case that you do not have as much labeled training data as you would like for a particular classification task, but you might have a larger set of unlabeled data. In the extreme case, you may not have any labeled data at all – an unsupervised learning problem. Let’s say we have a dataset of documents D = {d1,d2,...,dn}, a set of topics T = {t1,t2,...,tm}, and a set of words in the English language W = {w1,w2,...,wo}. A given document D ∈D in our corpus has a topic T ∈T and a set of words W ⊆W. In our training set, we’ll always be able to observe D (a document ID number) and the set of words W in the document, but may not have T. In the test set, we’ll always be able to observe D and W but never T. Suppose that our corpus can be modeled as a Bayes Net with random variables D, T, and {W1,...,Wo}, where Wi ∈{0,1} indicates whether word wi ∈W appears in the document or not. Variable D is connected to T via a directed edge, and then T is connected to each Wi via a directed edge.
Write a program that accepts command line arguments like this:
./topics mode dataset-directory model-file [fraction]
where mode is either test or train, dataset-directory is a directory containing directories for each of the topics (you can assume there are exactly 20), and model-file is the filename of your trained model. In training mode, an additional parameter fraction should be a number between 0.0 and 1.0 indicating the fraction of labeled training examples that your training algorithm is allowed to see.
In training mode, dataset-directory will be the training dataset and the program should write the trained model to model-file. When your training program is loading in the training data, it should “flip a coin” to decide whether it is allowed to see the class label associated with a given document, where the probability of “yes” is the fraction indicated on the command line. When fraction is 1.0, then this is a fully-supervised learning problem that is almost identical to problem 1 of Part 1. When fraction is 0.0, the program sees no labels, which is a fully unsupervised problem. In addition to the trained model, your program should output the 10 words with highest P(T = ti|Wj) for each of the 20 topics to a file called distinctive words.txt. You can structure your model-file format however you’d like. We have not prepared skeleton code this time, so you may also prepare your source code however you’d like.
Hint: To deal with missing data, adopt an iterative approach. Estimate values in the conditional probabilities tables using the portion of the data that is labeled (or randomly, if no data is labeled). Then, estimate the distribution over T for each document D using variable elimination. Now you have “hallucinated” labels for the training set and can re-estimate the conditional probability tables, which can then be used to re-estimate the topic distributions, and so on. Continue iterating until convergence or until you run out of time. :)
In testing mode, report performance of the algorithm in terms of accuracy and a confusion matrix. Note that this testing program is nearly functionally equivalent to question 1 of Part 1, so if you make that implementation general enough, you should not have much work to do here. (Similarly, much of the learning code will be very similar to that problem also.) In your report, show the accuracies as a function of fraction; particularly interesting cases are when fraction is 1 (fully supervised), 0 (unsupervised), or low (e.g. 0.1).