$29
Computer Science Department Pruning Decision Trees
The purpose of this problem set is to look at the effect of pruning on decision trees. As before, we need a generative
model for data so that we can run repeatable experiments. Let {(X1
, Y1),(X2
, Y2), . . . ,(Xm, Ym)} denote a data
set, where Xi
represents a vector of k (binary) feature values, and Yi
is a corresponding binary class or label that
we will need to learn to be able to predict from the X-values.
We generate data via the following scheme, defining a distribution for our data set: Let X = (X0, X1, X2, X3, . . . , X20)
be a vector of binary values, satisfying the following
• X0 = 1 with probability 1/2, X0 = 0 with probability 1/2
• For i = 1, . . . , 14, Xi = Xi−1 with probability 3/4, and Xi = 1 − Xi−1 with probability 1/4
• For i = 15, . . . , 20, Xi = 1 with probability 1/2, Xi = 0 with probability 1/2.
The first feature is uniformly random, and the next 14 features are strongly correlated, but the last 5 features are
independent of everything else. There are 21 X-variables, so there are 221 ≈ 2 mil possible input X. Some of these
are more likely than others. In general, we expect the training data to cover only a fraction of the total possible
inputs, so consider data sets of size m where m ranges from 10 to 10, 000. We then define Y to be
Y =
majority(X1, . . . , X7) if X0 = 0
majority(X8, . . . , X14) if X0 = 1.
(1)
That is, if X0 = 0, we take the majority value of X1 through X7 - otherwise we take the majority value of X8 through
X14. The values X15 through X20 are nothing but noise.
1) Write a function to generate m samples of (X, Y ), and another to fit a tree to that data using ID3. Write a
third function to, given a decision tree f, estimate the error rate of that decision tree on the underlying data,
err(f). Do this repeatedly for a range of m values, and plot the ‘typical’ error of a tree trained on m data
points as a function of m. Does this agree with your intuition?
2) Note that X15 through X20 are completely irrelevant to predicting the value of Y . For a range of m values,
repeatedly generate data sets of that size and fit trees to that data, and estimate the average number of
irrelevant variables that are included in the fit tree. How much data would you need, typically, to avoid fitting
on this noise?
3) Generate a data set of size m = 10000, and set aside 8000 points for training, and 2000 points for testing. The
remaining questions should all be applied to this data set.
a) Pruning by Depth: Consider growing a tree as a process - running ID3 for instance until all splits
up to depth d have been performed. Depth d = 0 should correspond to no decisions - a prediction for Y
is made just on the raw frequencies of Y in the data. Plot, as a function of d, the error on the training
set and the error on the test set for a tree grown to depth d. What does your data suggest as a good
threshold depth?
b) Pruning by Sample Size: The less data a split is performed on, the less ‘accurate’ we expect the
result of that split to be. Let s be a threshold such that if the data available at a node in your decision
tree is less than or equal to s, you do not split and instead decide Y by simple majority vote (ties broken
by coin flip). Plot, as a function of s, the error on the training set and the error on the testing set for a
tree split down to sample size s. What does your data suggest as a good sample size threshold?
1
Computer Science Department - Rutgers University Spring 2019
c) Pruning by Significance: If a variable X is independent of Y , then X has no value as a splitting
variable. We can use something like the χ
2
-test to estimate how likely a potential splitting variable is
to be independent, based on the test statistic T compared to some threshold T0 (in the usual 2-outcome
case, T0 = 3.841 is used to test at a significance level of p = 5% - see notes for more explanation). Given
T0, if given the data for X the value of T is less than T0, it is deemed not significant and is not used for
splitting. If given the data for X the value of T is greater than T0, it is deemed significant, and used for
splitting. Plot, as a function of T0, the error on the training set and the error on the testing set for a tree
split at significance threshold T0. What does your data suggest as a good threshold for significance?
5) Repeat the computation of Problem 2, growing your trees only to depth d as chosen in 3.a. How does this
change the likelihood or frequency of including spurious variables in your trees?
6) Repeat the computation of Problem 2, splitting your trees only to sample size s as chosen in 3.b. How does
this change the likelihood or frequency of including spurious variables in your trees?
7) Repeat the computation of Problem 2, splitting your trees only at or above threshold level T0 as chosen in 3.c.
How does this change the likelihood or frequency of including spurious variables in your trees?
2