$30
HOMEWORK 2
Instructions: Although this is a programming homework, you only need to hand in a pdf answer file. There
is no need to submit the latex source or any code. You can choose any programming language, as long as you
implement the algorithm from scratch (e.g. do not use sklearn or any other library’s decision tree implmentation
on questions 1 to 7).
Use this latex file as a template to develop your homework. Submit your homework on time as a single pdf file to
Canvas. Please check Piazza for updates about the homework.
1 A Simplified Decision Tree
You are to implement a decision-tree learner for classification. To simplify your work, this will not be a general
purpose decision tree. Instead, your program can assume that
• each item has two continuous features x ∈ R
2
• the class label is binary and encoded as y ∈ {0, 1}
• data files are in plaintext with one labeled item per line, separated by whitespace:
x11 x12 y1
...
xn1 xn2 yn
Your program should implement a decision tree learner according to the following guidelines:
• Candidate splits (j, c) for numeric features should use a threshold c in feature dimension j in the form of
x·j ≥ c.
• c should be on values of that dimension present in the training data; i.e. the threshold is on training points,
not in between training points. You may enumerate all features, and for each feature, use all possible values
for that dimension.
• You may skip those candidate splits with zero split information (i.e. the entropy of the split), and continue
the enumeration.
• The left branch of such a split is the “then” branch, and the right branch is “else”.
• Splits should be chosen using information gain ratio. If there is a tie you may break it arbitrarily.
• The stopping criteria (for making a node into a leaf) are that
– the node is empty, or
– all splits have zero gain ratio (if the entropy of the split is non-zero), or
– the entropy of any candidate split is zero
• To simplify, whenever there is no majority class in a leaf, let it predict y = 1.
1
Homework 2 CS 760 Machine Learning
2 Questions
1. (Our algorithm stops at pure labels) [5 pts] If a node is not empty but contains training items with the same
label, why is it guaranteed to become a leaf? Explain. You may assume that the feature values of these
items are not all the same.
2. (Our algorithm is greedy) [5 pts] Handcraft a small training set where both classes are present but the
algorithm refuses to split; instead it makes the root a leaf and stop; Importantly, if we were to manually
force a split, the algorithm will happily continue splitting the data set further and produce a deeper tree with
zero training error. You should (1) plot your training set, (2) explain why. Hint: you don’t need more than a
handful of items.
3. (Gain ratio exercise) [10 pts] Use the training set Druns.txt. For the root node, list all candidate cuts and
their information gain ratio. If the entropy of the candidate split is zero, please list its mutual information
(i.e. information gain). Hint: to get log2
(x) when your programming language may be using a different
base, use log(x)/log(2). Also, please follow the split rule in the first section.
4. (The king of interpretability) [10 pts] Decision tree is not the most accurate classifier in general. However,
it persists. This is largely due to its rumored interpretability: a data scientist can easily explain a tree to a
non-data scientist. Build a tree from D3leaves.txt. Then manually convert your tree to a set of logic
rules. Show the tree1
and the rules.
5. (Or is it?) [20 pts] For this question only, make sure you DO NOT VISUALIZE the data sets or plot your
tree’s decision boundary in the 2D x space. If your code does that, turn it off before proceeding. This is
because you want to see your own reaction when trying to interpret a tree. You will get points no matter
what your interpretation is. And we will ask you to visualize them in the next question anyway.
• Build a decision tree on D1.txt. Show it to us in any format (e.g. could be a standard binary tree with
nodes and arrows, and denote the rule at each leaf node; or as simple as plaintext output where each
line represents a node with appropriate line number pointers to child nodes; whatever is convenient
for you). Again, do not visualize the data set or the tree in the x input space. In real tasks you will not
be able to visualize the whole high dimensional input space anyway, so we don’t want you to “cheat”
here.
• Look at your tree in the above format (remember, you should not visualize the 2D dataset or your
tree’s decision boundary) and try to interpret the decision boundary in human understandable English.
• Build a decision tree on D2.txt. Show it to us.
• Try to interpret your D2 decision tree. Is it easy or possible to do so without visualization?
6. (Hypothesis space) [10 pts] For D1.txt and D2.txt, do the following separately:
• Produce a scatter plot of the data set.
• Visualize your decision tree’s decision boundary (or decision region, or some other ways to clearly
visualize how your decision tree will make decisions in the feature space).
Then discuss why the size of your decision trees on D1 and D2 differ. Relate this to the hypothesis space of
our decision tree algorithm.
7. (Learning curve) [20 pts] We provide a data set Dbig.txt with 10000 labeled items. Caution: Dbig.txt
is sorted.
• You will randomly split Dbig.txt into a candidate training set of 8192 items and a test set (the rest).
Do this by generating a random permutation, and split at 8192.
• Generate a sequence of five nested training sets D32 ⊂ D128 ⊂ D512 ⊂ D2048 ⊂ D8192 from the
candidate training set. The subscript n in Dn denotes training set size. The easiest way is to take
the first n items from the (same) permutation above. This sequence simulates the real world situation
where you obtain more and more training data.
• For each Dn above, train a decision tree. Measure its test set error errn. Show three things in your
answer: (1) List n, number of nodes in that tree, errn. (2) Plot n vs. errn. This is known as a learning
curve (a single plot). (3) Visualize your decision trees’ decision boundary (five plots).
1When we say show the tree, we mean either the standard computer science tree view, or some crude plaintext representation of the tree –
as long as you explain the format. When we say visualize the tree, we mean a plot in the 2D x space that shows how the tree will classify any
points.
2
Homework 2 CS 760 Machine Learning
3 sklearn [10 pts]
Learn to use sklearn https://scikit-learn.org/stable/. Use sklearn.tree.DecisionTreeClassifier to
produce trees for datasets D32, D128 . . . D8192. Show two things in your answer: (1) List n, number of nodes in
that tree, errn. (2) Plot n vs. errn.
4 Lagrange Interpolation [10 pts]
Fix some interval [a, b] and sample n = 100 points x from this interval uniformly. Use these to build a training
set consisting of n pairs (x, y) by setting function y = sin(x).
Build a model f by using Lagrange interpolation, as discussed in lecture. Generate a test set using the same distribution as your test set. Compute and report the resulting model’s train and test error. What do you observe?
Repeat the experiment with zero-mean Gaussian noise ε added to x. Vary the standard deviation for ε and report
your findings.
3