Starting from:

$29

Machine Learning Homework 1

Machine Learning  Homework 1
Homework must be submitted electronically on Canvas. Make sure to explain you reasoning or show your
derivations. Except for answers that are especially straightforward, you will lose points for unjustified
answers, even if they are correct.
General Instructions
You are allowed to work with at most one other student on the homework. With your partner, you will
submit only one copy, and you will share the grade that your submission receives. You should set up your
partnership on Canvas as a two-person group.
Submit your homework electronically on Canvas. We recommend using LaTeX, especially for the written problems. But you are welcome to use anything as long as it is neat and readable.
Include a README file that describes all other included files. Indicate which files you modified. You are
welcome to create additional functions, files, or scripts, and you are also welcome to modify the included
interfaces for existing functions if you prefer a different organization.
Since we may work on some of the homework in class, clearly indicate which parts of your submitted
homework are work done in class and which are your own work.
Relatedly, cite all outside sources of information and ideas.
Written Problems
1. (5 points) Show what the recursive decision tree learning algorithm would choose for the first split of
the following dataset:
ID X1 X2 X3 X4 Y
1 0 0 0 0 0
2 0 0 0 1 0
3 0 0 1 0 0
4 0 0 1 1 0
5 0 1 0 0 0
6 0 1 0 1 1
7 0 1 1 0 1
8 0 1 1 1 1
9 1 0 0 0 1
10 1 0 0 1 1
Assume that the criterion for deciding the best split is entropy reduction (i.e., information gain). If
there are any ties, choose the first feature to split on tied for the best score. Show your calculations in
your response.
(Hint: this dataset is one of the test cases in the programming assignment, so you should be able to
use your answer here to debug your code.)
2. A Bernoulli distribution has the following likelihood function for a data set D:
p(D|θ) = θ
N1
(1 − θ)
N0
, (1)
where N1 is the number of instances in data set D that have value 1 and N0 is the number in D that
have value 0. The maximum likelihood estimate is
ˆθ =
N1
N1 + N0
. (2)
(a) (5 points) Derive the maximum likelihood estimate above by solving for the maximum of the
likelihood. I.e., show the mathematics that get from Equation (1) to Equation (2).
1
(b) (5 points) Suppose we now want to maximize a posterior likelihood
p(θ|D) = p(D|θ)p(θ)
p(D)
, (3)
where we use the Bernoulli likelihood and a (slight variant1 of a) symmetric Beta prior over the
Bernoulli parameter
p(θ) ∝ θ
α
(1 − θ)
α
. (4)
Derive the maximum posterior mean estimate.
Programming Assignment
For this homework, you will build two text categorization classifiers: one using naive Bayes and the other
using decision trees. You will write general code for cross-validation that will apply to either of your
classifiers.
Data and starter code: In the HW1 archive, you should find the 20newsgroups data set (also available
from the original source http://qwone.com/~jason/20Newsgroups/). This data set, whose origin is somewhat fuzzy, consists of newsgroup posts from an earlier era of the Internet. The posts are in different
categories, and this data set has become a standard benchmark for text classification methods.
The data is represented in a bag-of-words format, where each post is represented by what words are
present in it, without any consideration of the order of the words.
We have also provided a unit test class in tests.py, which contains unit tests for each type of learning
model. These unit tests may be easier to use for debugging in an IDE like PyCharm than the iPython notebook. A successful implementation should pass all unit tests and run through the entire iPython notebook
without issues. You can run the unit tests from a *nix command line with the command
python -m unittest -v tests
or you can use an IDE’s unit test interface. These tests are not foolproof, so it’s possible for code that does
not meet the requirements for full credit to pass the tests (and, though it would be surprising, it may be
possible for full credit code to fail the tests).
Before starting all the tasks, examine the entire codebase. Follow the code from the iPython notebook to
see which methods it calls. Make sure you understand what all of the code does.
Your required tasks follow.
1. (0 points) Examine the iPython notebook test predictors.ipynb. This notebook uses the learning
algorithms and predictors you will implement in the first part of the assignment. Read through the
data-loading code and the experiment code to make sure you understand how each piece works.
2. (0 points) Examine the function calculate information gain in decision tree.py. The function
takes in training data and training labels and computes the information gain for each feature. That is,
for each feature dimension, compute
G(Y, Xj ) = H(Y ) − H(Y |Xj )
= −
X
y
Pr(Y = y) log Pr(Y = y)+
X
xj
Pr(Xj = xj )
X
y
Pr(Y = y|Xj = xj ) log Pr(Y = y|Xj = xj ).
(5)
Your function should return the vector
[G(Y, X1), . . . , G(Y, Xd)]>. (6)
1For convenience, we are using the exponent of α instead of the standard α − 1.
2
Algorithm 1 Recursive procedure to grow a classification tree
1: function FITTREE(D, depth)
2: if not worth splitting (because D is all one class or max depth is reached) then
3: node.prediction ← arg maxc
P
(x,y)∈D I(y = c)
4: return node
5: w ← arg maxw0 G(Y, Xw) . See Equation (5)
6: node.test ← w
7: node.left ← FITTREE(DL, depth+1) . where DL := {(x, y) ∈ D|xw = 0}
8: node.right ← FITTREE(DR, depth+1) . where DR := {(x, y) ∈ D|xw = 1}
9: return node
You will use this function to do feature selection and as a subroutine for decision tree learning. Note
how the function avoids loops over the dataset and only loops over the number of classes. Follow this
style to avoid slow Python loops; use numpy array operations whenever possible.
3. (5 points) Finish the functions naive bayes train and naive bayes predict in naive bayes.py. The
training algorithm should find the maximum likelihood parameters for the probability distribution
Pr(yi = c|xi) = Pr(yi = c)
Q
w∈W Pr(xiw|yi = c)
Pr(xi)
.
Make sure to use log-space representation for these probabilities, since they will become very small,
and notice that you can accomplish the goal of naive Bayes learning without explicitly computing the
prior probability Pr(xi). In other words, you can predict the most likely class label without explicitly
computing that quantity.
Implement additive smoothing (https://en.wikipedia.org/wiki/Additive_smoothing) for your
naive Bayes learner. One natural way to do this is to let the input parameter params simply be the
prior count for each word. For a parameter α, this would mean your maximum likelihood estimates
for any Bernoulli variable X would be
Pr(X) = (# examples where X) + α
(Total # of examples) + 2α
.
Notice that if α = 0, you get the standard maximum likelihood estimate. Also, make sure to multiply
α by the total number of possible outcomes in the distribution. For the label variables in the 20newsgroups data, there are 20 possible outcomes, and for the word-presence features, there are two.
4. (5 points) Finish the functions recursive tree train and decision tree predict in decision tree.py.
Note that recursive tree train is a helper function used by decision tree train, which is already
completed for you. You’ll have to design a way to represent the decision tree in the model object. Your
training algorithm should take a parameter that is the maximum depth of the decision tree, and the
learning algorithm should then greedily grow a tree of that depth. Use the information-gain measure
to determine the branches (hint: you’re welcome to use the calculate information gain function).
Algorithm 1 is abstract pseudocode describing one way to implement decision tree training. You are
welcome to deviate from this somewhat; there are many ways to correctly implement such procedures.
The pseudocode suggests building a tree data structure that stores in each node either (1) a prediction
or (2) a word to split on and child nodes. The pseudocode also includes the formula for the entropy
criterion for selecting which word to split on.
The prediction function should have an analogous recursion, where it receives a data example and a
node. If the node has children, the function should determine which child to recursively predict with.
If it has no children, it should return the prediction stored at the node.
3
5. (5 points) Finish the function cross validate in crossval.py, which takes a training algorithm, a
prediction algorithm, a data set, labels, parameters, and the number of folds as input and performs
cross-fold validation using that many folds. For example, calling
params[’alpha’] = 1.0
score = cross_validate(naive_bayes_train, naive_bayes_predict, train_data,
train_labels, 10, params)
will compute the 10-fold cross-validation accuracy of naive Bayes using regularization parameter
α = 1.0.
The cross-validation should split the input data set into folds subsets. Then iteratively hold out each
subset: train a model using all data except the subset and evaluate the accuracy on the held-out subset.
The function should return the average accuracy over all folds splits.
Some code to manage the indexing of the splits is included. You are welcome to change it if you prefer
a different way of organizing the indexing.
Once you complete this last step, you should be able to run the notebook cv predictors.ipynb,
which should use cross validation to compare decision trees to naive Bayes on the 20-newsgroups task.
Naive Bayes should do be much more accurate than decision trees, but the cross-validation should
find a decision tree depth that performs a bit better than the depth hard coded into test predictors.ipynb.
4

More products