Starting from:

$25

Assignment 4 - Classification

Machine Learning
Assignment 4 - Classification
Introduction
In this assignment you will implement a Naive Bayes classifier for the purpose of binary classification.
You may not use any functions from an ML library in your code. And as always your code should
work on any dataset that has the same general form as the provided one.
Grading
Although all assignments will be weighed equally in computing your homework grade, below is the
grading rubric we will use for this assignment:
Part 1 (Theory) 55pts
Part 2 (Naive Bayes) 35pts
Report 10pts
Extra Credit 10pts
TOTAL 110 (of 100) pts
1
Datasets
Spambase Dataset (spambase.data) This dataset consists of 4601 instances of data, each with
57 features and a class label designating if the sample is spam or not. The features are real valued
and are described in much detail here:
https://archive.ics.uci.edu/ml/machine-learning-databases/spambase/spambase.names
Data obtained from: https://archive.ics.uci.edu/ml/datasets/Spambase
2
1 Theory
1. Consider the following set of training examples for an unknown target function: (x1, x2) → y:
Y x1 x2 Count
+ T T 3
+ T F 4
+ F T 4
+ F F 1
- T T 0
- T F 1
- F T 3
- F F 5
(a) What is the sample entropy, H(Y ) from this training data (using log base 2) (5pts)?
(b) What are the information gains for branching on variables x1 and x2 (5pts)?
(c) Draw the decision tree that would be learned by the ID3 algorithm without pruning from
this training data. All leaf nodes should have a single class choice at them. If necessary
use the mean class or, in the case of a tie, choose one at random.(10pts)?
2. We decided that maybe we can use the number of characters and the average word length an
essay to determine if the student should get an A in a class or not. Below are five samples of
this data:
# of Chars Average Word Length Give an A
216 5.68 Yes
69 4.78 Yes
302 2.31 No
60 3.16 Yes
393 4.2 No
(a) What are the class priors, P(A = Y es), P(A = No)? (5pts)
(b) Find the parameters of the Gaussians necessary to do Gaussian Naive Bayes classification
on this decision to give an A or not. Standardize the features first over all the data together
so that there is no unfair bias towards the features of different scales (5pts).
(c) Using your response from the prior question, determine if an essay with 242 characters
and an average word length of 4.56 should get an A or not. Show the math to support
your decision (10pts).
3. Another common activation function for use in logistic regression or artificial neural networks
is the hyperbolic tangent function, tanh, which is defined as:
tanh(z) = e
z − e
−z
e
z + e
−z
(1)
3
(a) Since the hyperbolic tangent function outputs values in the range of −1 <= tanh(z) <= 1
we will have to augment our log likelihood objective function to deal with this range. If
we opt to use this function for logistic regression (as opposed to 1
1+e−z
), what will this
object function be? Show your work. (5pts)
(b) In order to compute the gradient of your previous answer with respect to θj
, we’ll need
to compute the gradient of the hyperbolic tangent function itself. Use the exponential
definition of the hyperbolic tangent function provided at the top of this problem to show
that

∂θj
(tanh(xθ)) = xj (1 − tanh(xθ)
2
). (5pts)
(c) Using the fact that ∂
∂θj
(tanh(xθ)) = xj (1 − tanh(xθ)
2
), what is the gradient of your log
likelihood function in part (a) with respect to θj? Show your work. (5pts)
4
2 Naive Bayes Classifier
For your first programming task, you’ll train and test a Naive Bayes Classifier.
Download the dataset spambase.data from Blackboard. As mentioned in the Datasets area, this
dataset contains 4601 rows of data, each with 57 continuous valued features followed by a binary
class label (0=not-spam, 1=spam). There is no header information in this file and the data is comma
separated. As always, your code should work on any dataset that lacks header information and has
several comma-separated continuous-valued features followed by a class id ∈ 0, 1.
Write a script that:
1. Reads in the data.
2. Randomizes the data.
3. Selects the first 2/3 (round up) of the data for training and the remaining for testing
4. Standardizes the data (except for the last column of course) using the training data
5. Divides the training data into two groups: Spam samples, Non-Spam samples.
6. Creates Normal models for each feature for each class.
7. Classify each testing sample using these models and choosing the class label based on which
class probability is higher.
8. Computes the following statistics using the testing data results:
(a) Precision
(b) Recall
(c) F-measure
(d) Accuracy
Implementation Details
1. Seed the random number generate with zero prior to randomizing the data
2. Matlab interprets 0log0 as N aN (not a number). You should identify this situation and consider
it to be a value of zero.
In your report you will need:
1. The statistics requested for your Naive Bayes classifier run.
5
Precision: Around 68%
Recall: Around 95%
F-Measure: Around 79%
Accuracy: Around 81%
Table 1: Evaluation for Naive Bayes classifier
3 Extra Credit: The Precision-Recall Tradeoff
For 10 extra credit points, find a data set of your choosing on which you can perform binary classification. Now apply your Naive Bayes code to this dataset and vary the threshold required to make
an observation be considered your positive class.
Write a script that:
1. Reads in the data.
2. Randomizes the data.
3. Selects the first 2/3 (round up) of the data for training and the remaining for testing
4. Standardizes the data using the training data
5. Divides the training data into two groups: Positive samples, Negative samples.
6. Creates Normal models for each feature and each class.
7. Computes the P(P ositive|data) and P(Negative|data) for each testing sample using Naive
Bayes, normalizing them such that P(P ositive|data) + P(Negative|data) = 1.
8. Vary the threshold from 0.0 to 1.0 in increments of 0.05, each time:
(a) Using the current threshold, label each testing sample as True Positive, True, Negative,
False Positive, or False Negative
(b) Compute the Precision and Recall for this threshold level.
9. Plot Precision vs Recall.
Implementation Details
1. In computing Precision, Recall, F-Measure and Accuracy make sure the denominators don’t
become zero. If they do, check the numerator. If that’s also zero then set the value to one.
6
Submission
For your submission, upload to Blackboard a single zip file (again no spaces or non-underscore special
characters in file or directory names) containing:
1. PDF Writeup
2. Source Code
3. If you did the extra credit, the dataset used.
4. readme.txt file
The readme.txt file should contain information on how to run your code to reproduce results for each
part of the assignment.
The PDF document should contain the following:
1. Part 1:
(a) Answers to theory questions
2. Part 2:
(a) Requested Classification Statistics
3. Extra Credit:
(a) Citation and link to dataset.
(b) Plot of Precision-vs-Accuracy
7

More products