$26.99
Decision Trees
In this assignment, you will explore decision trees. A binary decision tree is a full binary tree to identify the class to which observed data belongs. The following example illustrates this concept.
This example shows a decision tree to predict whether a given person wears glasses or not by simply using their age and gender. Obviously, using these two pieces of information is very limiting and certainly do not allow us to come to a reliable prediction. The percentages shown are an estimate of the accuracy of these predictions. They have been obtained based on a large number of samples (for which age, gender and presence of glasses were known). By applying the rules of the decision tree on a large number of observations, it is possible to obtain a good estimate of the accuracy of each of the classifications obtained in the leaves of the tree. Indeed, with these pre-‐classified samples, it is possible to count how many samples of each class reach a given leaf. For example, for the leaf corresponding to the decision [Age <45, Female, Age <24], it was observed that in 62% of cases the sample reaching this leaf was a person without glasses and in 38% of the cases it was a person without glasses. So if an observation to be classified enters the root and ends on this leaf, the class assigned will be 'without glasses' since there were a higher percentage of samples (62%) belonging to this class that arrived at this leaf. Obviously, a classification associated with a high percentage (say 97%) is far more reliable than a classification associated with a lower percentage, such as 62%.
To perform the classification, the decision tree uses input vectors that have a certain size (2 in our example, age and sex). The decision is to determine the membership class; in our example, there are 2 classes (with glasses or without glasses), but in general there may be more.
The decision tree consists of nodes (stumps) that use one of the entries of the decision vector and applies a threshold. Depending on the result of the comparison, we branch left or right. A vector enters the tree by the root. It follows a series of comparisons leading to a leaf of the tree. The file DecisionStump.java provides an implementation for a decision node. Each node specifies an index (featureIndex) in the vector and a threshold (threshold) to be used in the comparison. The leaves of the tree are dummy nodes without comparison, which return the class with the highest probability for that node. These probabilities are obtained either by specifying them directly or by accumulating the results obtained by using pre-‐classified samples (classFreq and total). The method main of this class builds the decision tree shown in the previous page, and tests its results with different inputs.
Part A (marks 50%) Using the DecisionStump class, complete the DecisionTree class by adding the following operations :
public void replace(DecisionStump leaf, int featureIndex, double threshold) : that replaces a given leaf of the decision tree by a new stump that applies a decision threshold on one element of a feature vector. This operation also creates also dummy leaves. This operation will be used to grow the decision tree.
public DecisionStump getSmallestMaxProb(): that returns the leaf having the smallest maximum probability (i.e. the least reliable node). Look at the getMaxClass and getMaxProb methods of DecisionStump. This operation will be used to replace an unreliable leaf by a new decision stump.
public void print() : that prints all the nodes of a decision tree by calling the toString of the DecisionStump instances in a pre-‐order traversal.
public int getDecision(double[] vector): that returns the class associated with a sample of unknown class. This operation will return the class number having the highest probability.
Part B (marks 25%) File iris.data.txt contains a list of samples specifying the length and width of the petals and setals of 3 different species of iris. This is then a file made of features vector of dimension 4 with 3 possible classes.
1. Generate a decision tree made of one stump that uses element at index 0 with a threshold of 5.0. // Create 1-node tree DecisionTree dt= new DecisionTree(6, 3); dt.replace(dt.getRoot(), 0, 5.0); 2. Call the train method of DecisionTree with each pre-‐classified samples in the given file. 3. Print all the nodes and their associated probabilities. 4. Find the leaf having the smallest maximal probability. Replace this one with a stump on index 2 with threshold 2.5. Train this new decision tree with the same dataset : // train a 2-node decision tree dt.replace(dt.getSmallestMaxProb(), 2, 2.5); dt.resetAll(); // add training code… 5. Print all the nodes and their associated probabilities. 6. Add another node and re-‐train the tree as follows : // train a 3-node decision tree dt.replace(dt.getSmallestMaxProb(), 1, 3.0); dt.resetAll(); // add training code… Part C (marks 25%)
We now ask you to generalize the process described in Part B.
1. Create a 1-‐node decision tree by adding a stump with randomly chosen index and threshold. 2. Train this tree with the given dataset of pre-‐classified iris samples. 3. Find the leaf having the smallest maximal probability and replace it with a new decision stump with randomly chosen index and threshold. 4. Repeat steps 2 and 3 up to 50 times. 5. Do you see an improvement in the performance of the classifier ? Justify. Print the answer to this question from your code; you can print relevant information about the decision trees to justify your answer.
WHAT TO HAND IN :
• Your program: DecisionTree.java (with required modifications), DecisionStump.java (no need to modify) • README.txt file (optional, explain any deviations (e.g. extra classes used) and any known bugs). • Printout.txt: output of your program showing results for part B and C; this is expected to be identical to what we would obtain if we run the program you submit. • Put the files specified above in a folder called u9999999 (where 9999999 is your student number); zip the folder and submit the zipped folder DEVIATION TO FOLLOW THESE STANDARDS WILL GET 5-‐10% MARKS OFF.