Starting from:

$29

Assignment 1 Criteria for Choosing a Feature to Split

Assignment 1
1 CriteriaforChoosingaFeaturetoSplit[20points] When choosing a single feature to use to classify—for example, if we are greedily choosing the next feature to use for a subset of the data inside of a decision tree—we proposed in lecture to choose the feature that will give the greatest reduction in error. Let D be the dataset under consideration, n be the number of examples in D with label 0 (negative examples), and p be the number of examples in D with label 1 (positive examples).
1.1 NotSplitting[5points] Suppose we have reached the maximum depth allowed by our learning algorithm; we may not use any more features. At this point, we have a partition of the data, D0 (which is a subset of D) and it contains n0 negative examples and p0 positive examples. Assuming that our algorithm is deterministic, what’s the smallest number of mistakes (in D0) we can make, and how do we obtain it? (Give a formula in terms of the symbols we have introduced above.)
1.2 Splitting[5points] Consider a binary feature φ with the following contingency table for D0:
y φ
0 1 0 n0 n1 1 p0 p1
2 of 9
Note that p0 = p0 + p1 and n0 = n0 + n1. If we split D0 based on feature φ, how many mistakes will we make (in D0)? (Give a formula in terms of the symbols we have introduced.) Another way to think about the which-feature-if-any-to-split-on decision is to maximize the number of mistakes that we would correct by splitting (as opposed to not splitting). This is given bysubtractingtheanswerto1.2fromtheanswerto1.1.Ifwedividethisvalueby|D0|(thenumber of examples in D0), then we have the absolute reduction in error rate achieved by making the split (as opposed to not splitting). From here on, call this value “error rate reduction.” Before proceeding, think carefully about this question: is it possible for a split to increase the error rate on the training set? (No points for this, but you should have a confident answer in your mind.)
1.3 MutualInformation[10points] A rather different way to think about the decision is to use information theory. Simply put, we wanttochooseafeature φ thatgivesusasmuchinformationaspossibleaboutthelabel,within D. Information theory gives us a way to think about measuring information. Mutual information1 is an association score between two random variables A and B. For simplicity’ssake,weassumebothofthemarebinaryandtakevaluesin{0,1}. Themutualinformation between A and B is given by: I(A;B) = X a∈{0,1} X b∈{0,1} P(A = a,B = b)log P(A = a,B = b) P(A = a)·P(B = b) (1) If we use base-2 logarithms, then this quantity is measured in bits. It is the average number of bits that observing random variable A tells you about random variable B, and vice versa, across many trials. IfAandB areperfectlypredictiveofeachother(eitheralwaysthesameoralwaysdifferent), then I(A;B) = 1 bit. If they carry no information about each other, then I(A;B) approaches 0. The formula for the mutual information between the binary value of feature φ and the binary label y, under the distribution observed in D, is as follows; we use |D| to denote the size of D; note that|D| = p + n = p0 + n0 + p1 + n1. This is obtained by thinking about the feature and the label as two random variables and plugging in relative frequency estimates of the probabilities.
I(φ;Y ) =
n0 |D|
log |D|n0 (n0 + p0)n
+
p0 |D|
log |D|p0 (n0 + p0)p
+
n1 |D|
log |D|n1 (n1 + p1)n
+
p1 |D|
log |D|p1 (n1 + p1)p (2) We’vedoneabitofsimplificationontheformulaabove,butit’sstillprettydaunting. Thereare two ways to go about gaining some intuition here. The first is to learn about information theory; we hope you’ll consider doing that.2 1Strictlyspeaking,thisquantityisknownasaveragemutualinformation. Whenusedforbuildingdecisiontrees,it is often called “information gain.” 2CoverandThomas[1]isanexcellentintroductorytextbook,anoldereditionofwhichyourinstructorownsacopy that is perpetually borrowed by a Ph.D. student—a good sign that it’s useful. If you find an amazing tutorial online, please recommend it to the class!
3 of 9
Thesecondwayistoexplorehowthesefunctions(errorratereductionandmutualinformation) change for features with different relationships to the labels. Suppose that|D| = 1000. Let’s start by fixing n = p = 500; our dataset is perfectly balanced. Imagine that the dataset-creators have beenquitegeneroustousandprovideduswith499features. For i from1to499,letthe ithfeature φi have values n0 = p1 = i and n1 = p0 = 500−i. (Remember: this is a contrived example. Real world data will never be so perfectly constructed!) Generateaplotwherethex-axisisi(theindexofoneofthe499features)andthey-axisiserror rate reduction for that feature. On the same plot, in a different color, show the mutual information between the feature and the label. What do you notice, and what conclusions can you draw?
2 DecisionStumpsandTrees[40points] It might be helpful for you to draw the decision boundary, in the figure, of each classifier in each part. Consider the problem of predicting whether a person has a college degree based on age and salary. The table and graph below contain training data for 10 individuals. Age Salary ($) College Degree 24 40,000 Yes 53 52,000 No 23 25,000 No 25 77,000 Yes 32 48,000 Yes 52 110,000 Yes 22 38,000 Yes 43 44,000 No 52 27,000 No 48 65,000 Yes
20 25 30 35 40 45 50 55
0.2
0.4
0.6
0.8
1
·105
Age
Salary
4 of 9
2.1 BuildTwoDecisionStumps[10points] In class, we focused on discrete features. One way to deal with continuous (or ordinal) data is to define binary features based on thresholding of continuous features like Age and Salary. For example, you might let: φ(x) =?0 if Age≤50 1 otherwise (3) The choice of the threshold 50 is, on the surface, arbitrary. In practice, though, for a given feature, you should only consider the best threshold for that feature. While the search for a threshold mightseemdaunting,theonlythresholdsworthconsideringwillbemidpointsbetweenconsecutive feature values, after sorting.3 For each of the two features in this dataset (Age and Salary), generate a plot showing the training set accuracy (on the y-axis) as a function of the threshold value (on the x-axis). You will probably want to automate the calculations. Do the same for mutual information as a criterion. Report the training-set error rates of all four decision stumps. Did the criterion for selecting a threshold make a difference?
2.2 BuildDecisionTrees[14points] Build a decision tree for classifying whether a person has a college degree by greedily choosing threshold splits. Do this two ways: by counting errors (as shown in lecture) and using mutual information (as derived above). Show both decision trees and report their training set error rates.
2.3 MultivariateDecisionTree[12points] Multivariate decision trees are a generalization of the “univariate” decision trees we’ve discussed so far. In a multivariate decision tree, more than one attribute can be used in the decision rule for eachsplit. Thatis,fromageometricpointofview,splitsneednotbeorthogonaltoafeature’saxis. For the same data, learn a multivariate decision tree where each split makes decisions based on the sign of the function αxAge + βxIncome − 1. Draw your tree, including the α,β, and the information gain for each split.
2.4 Discussion[4points] Multivariate decision trees have practical advantages and disadvantages. List an advantage and a disadvantage multivariate decision trees have compared to univariate decision trees. 3Imaginemovingathresholdjustalittlebit,butnotenoughtochangethedecisionforanydatapoint. Forexample, thethresholdinEquation3couldbereducedto49orincreasedto51,andtherewouldbenoeffectontheclassification decisions in our dataset.
5 of 9
3 A“Baby”NoFreeLunchTheorem[15points] Suppose that Alice has a function f : {0,1}m →{0,1}mapping binary sequences of length m to a label. Suppose Bob wants to learn this function. Alice will only provide labelled random examples. Specifically, Alice will provide Bob with a training set under the following data generating distribution: the input x is a uniformly at random binary string of length m and the corresponding label is f(x) (the label is deterministic, based on Alice’s function). Bob seeks to learn a function ˆ f which has low mis-classification expected loss. Remember to provide short explanations.
3.1 The#ofinputs[3points] How many possible inputs are there for m length binary inputs? And, specifically for m = 5, how many are there?
3.2 The#offunctions[5points] The function f is one function out of how many possible functions acting on m length binary sequences? And, specifically for m = 5, how many possible functions are there?
3.3 Obtaininglowexpectedloss[5points] Assume that Bob has no knowledge about how Alice chose her function f. Suppose Bob wants to guarantee that he could learn with an expected loss that is non-trivially better than chance, e.g. supposeBobwantsanexpectedmis-classificationlosslessthan25%(oranylossthatisaconstant less than 50%). Roughly (in “Big-Oh” notation), how many distinct inputs x does Bob need to observe in the training set in order to guarantee such an expected loss? Again, as a special case, howmanydistinctinputsdoyouneedtoseefor m = 5? (Youdonotneedapreciseanswerforthe latter, just in the right ballpark.) Some comments: Note that the required the training set size that Bob needs to observe is slightly larger than this number of distinct inputs, due to the “coupon collectors” problem where repeats occur. You do not need to write an answer to the following question though it is worth thinking about: is this number of distinct training examples notably different from the number of possible functions? why or why not?
3.4 Discussion[2points] NowsupposeBobobtainsatrainingsetwhosesizeismuchsmallerthanthecorrectanswertopart 3.3 and that, using this training set, Bob believes he has learned a function ˆ f whose error rate is non-trivially better than guessing. How can Bob convince another person, say Janet, that he has learned something non-trivially better than chance? Further, suppose that Bob did convince Janet of this (assume Janet is not easily fooled), what does this say about Bob’s knowledge with regards to how Alice chose her function?
6 of 9
4 Implementation: K-Means[40points] In class we discussed the K-means clustering algorithm. You will implement the K-means algorithm on digit data. A template has been prepared for you: it is named kmeans template.py and is available on Canvas in the A1.tgz download. Your job is to fill in the functions as described below.
4.1 Data The data are held in two files: one for the inputs and one for the output. The file digit.txt contains the inputs: namely, 1000 observations of 157 pixels (a randomly chosen subset of the original 784) from images containing handwritten digits. The file labels.txt contains the true digitlabel(one,three,five, orseven, eachcodedasadigit). ThePythontemplatehandlesreading in the data; just make sure that all the files live in the same directory as your code. Thelabelscorrespondtothedigitfile,sothefirstlineoflabels.txtisthelabelforthedigit in the first line of digit.txt.
4.2 Algorithm[15points] Your algorithm will be implemented as follows: 1. Initialize k starting centers by randomly choosing k points from your data set. You should implement this in the initialize function. [5 points] 2. Assign each data point to the cluster associated with the nearest of the k center points. Implement this by filling in the assign function. [5 points] 3. Re-calculate the centers as the mean vector of each cluster from (2). Implement this by filling in the update function. [5 points] 4. Repeat steps (2) and (3) until convergence. This wrapping is already done for you in the loop function, which lives inside the cluster function.
4.3 Within-GroupSumofSquares[5points] One way to formalize the goal of clustering is to minimize the variation within groups, while maximizing the variation between groups. Under this view, a good model has a low “sum of squares” within each group. We define sum of squares as follows. Let µk (a vector) be the mean of the observations (each one a vector) in kth cluster. Then the within group sum of squares for kth cluster is defined as: SS(k) =X i∈k kxi −µkk2 2 where “i ∈ k” ranges over the set of indices i of observations assigned to the kth cluster. Note that the term kxi − µkk2 is the Euclidean distance between xi and µk, and therefore should be
7 of 9
calculated askxi −µkk2 =qPd j=1 (xi[j]−µk[j])2, where d is the number of dimensions. Note thatthattermissquaredin SS(k),cancellingthesquarerootintheformulaforEuclideandistance. If there are K clusters total then the “sum of within-group sum of squares” is the sum of all K of these individual SS(k) terms. In the code, the within-group sum of squares is referred to (for brevity) as the distortion. Your job is to fill in the function compute distortion.
4.4 MistakeRate[5points] Typically K-means clustering is used as an exploration tool on data where true labels are not available, and might not even make sense. (For example, imagine clustering images of all of the foods you have eaten this year.) Given that, here, we know the actual assignment labels for each data point, we can analyze how well the clustering recovered the true labels. For kth cluster, let the observations assigned to that cluster vote on a label (using their true labels as their votes). Then let its assignment be the label with the most votes. If there is a tie, choose the digit that is smaller numerically. This is equivalent to deciding that each cluster’s members will all take the same label, and choosing the label for that cluster that lead to the fewest mistakes. Forexample,ifforclustertwowehad270observationslabeledone,50labeledthree,9labeled five, and 0 labeled seven then that cluster will be assigned value one and had 50 + 9 + 0 = 59 mistakes. If we add up the total number of mistakes for each cluster and divide by the total number of observations (1000) we will get our total mistake rate, between 0 and 1. Lower is better. Yourjob isto fillinthefunction label clusters, whichimplementsthelabelassignment. Once you have implemented this, the function compute mistake rate (already written) will compute the mistake rate for you.
4.5 PuttingitAllTogether[15points] Onceyouhavefilledinallthefunctions,youcanrunthePythonscriptfromthecommandlinewith python kmeans template.py. Thescriptwillloopoverthevalues K ∈{1,2,...,10}. For each value of k, it will randomly initialize 5 times using your initialization scheme, and keep the resultsfromtherunthatgavethelowestwithin-groupsumofsquares. Forthebestrunforeach K, it will compute the mistake rate using your cluster labeling function. The code will output a figure named kmeans.png. The figure will have two plots. The top one shows within-group sum of squares as a function of K. The bottom one shows the mistake rate, also as a function of K. Write down your thoughts on these results. Are they what you expected? Did you think that within-group sum of squares and mistake rate would go up or decrease as K increased? Did the plots confirm your expectations?
8 of 9
References [1] Thomas M. Cover and Joy A. Thomas. Elements of Information Theory. John Wiley & Sons, 2012.
9 of 9

More products