$29
1 Programming (55 points)
In this assignment you will implement a Naive Bayes classifier with several variations. For this assignment, we have provided special datasets, semisup and multitask.
1.1 Naive Bayes
You will begin by implementing a Naive Bayes classifier. We’ll be using the codebase from HW1 and HW2 as a starting point. The command line algorithm argument will be nb.
1.1.1 Naive Bayes Generative Story
In this section we will describe the case where we have a single label yi per example. Naive Bayes maximizes the joint likelihood of the data and labels:
p(X,Y) =
N Y i=1
p(yi)p(xi | yi) =
N Y i=1
p(yi)
M Y j=1
p(xij | yi) (1)
Naive Bayes makes an assumption of conditional independence of each feature given the label, which means we can rewrite the conditional likelihood of the example xi given its label yi by factoring it into the distribution over its components. The decision rule for Naive Bayes is:
argmax yi
p(yi | xi) = argmax yi
p(yi)
M Y j=1
p(xij | yi) (2)
In this assignment we will assume a Naive Bayes model with K = 2 classes (binary classification). Additionally, each feature will be binary, xij ∈{0,1}. The maximum likelihood solution for Naive Bayes gives a closed form solution for the two distributions. p(yi = k) = |{yi : yi = k}|+ 1 N + K (3) where|{yi : yi = k}|is the number of times we observe y with state k in the training data, and N are the number of training examples. Notice that we have smoothed this distribution by adding 1 to our counts, which will prevent (some) over-fitting. The conditional
Fall 2018 CS 475 Machine Learning: Homework 4 2
distribution is given by: p(xij = 1 | yi = k) = |{yi,xij : yi = k,xij = 1}|+ 1 |{yi : yi = k}|+ 2
, (4)
where |{yi,xij : yi = k,xij = 1}| is the number of examples with binary feature xij = 1 and yi = k. We have added smoothing here as well.
1.2 Semi-Supervised Naive Bayes
Your first task will be to implement a version of Naive Bayes that can be trained with both labeled and unlabeled examples. The data files we provide will contain two labels 0 and 1 as well as a value of −1 for an unlabeled example. You will use the data files with prefix semisup. Since some variables are unobserved, you will implement an EM algorithm to train the model. Our goal is to maximize the likelihood of the observed data: p(Xu,Xl,Yl) =X Yu p(Xu,Xl,Yl,Yu) = Nl Y i=1 p(yi) M Y j=1 p(xij | yi)!X Yu Nu Y i=1 p(yi) M Y j=1 p(xij | yi)! (5) Xu are the unlabeled examples, Yu are the unobserved labels for those examples, and Nu are the number of unlabeled examples. The same definitions hold true for Xl,Yl,Nl. Since we are interested in maximizing the log likelihood, we will have difficulty with the summation of Yu. Therefore, we will use EM to maximize the expected complete data log likelihood, where θ are the model parameters. Here, we simplify our notation by merging the unlabeled and labeled data. Q(θ,θ(t−1)) = E"X i logp(xi,yi|θ)# (6) = X i E"log K Y k=1 p(yi = k)p(xi|θk)I(yi=k))# = X i X k E[I(yi = k)log[p(yi = k)p(xi|θk)]] = X i X k p(yi = k|xi,θt−1)logp(yi = k)p(xi|θk)
1.2.1 Expectation Maximization Your task will be to learn model parameters p(xij|y) and p(y) from both the labeled and unlabeled data to produce a trained Naive Bayes classifier. In our EM algorithm, we will use a hard assignment for each xi during training. Each data point xi will be assigned to a single label, in contrast to a soft assignment where examples are assigned a distribution over labels. In your E-step (Expectation), you calculate the posterior of yi given xi and current model parameters for all xi ∈ Xu: p(yi = k | xi) = p(xi | yi = k)p(yi = k) p(xi) = p(xi | yi = k)p(yi = k) p(xi) (7)
Fall 2018 CS 475 Machine Learning: Homework 4 3 where p(xi) =Py p(xi | y)p(y). Remember to substitute in the factorization of p(xi | yi = k) based on the Naive Bayes conditional independence assumption. If xi ∈ Xl then yi will be observed. In your M-step (Maximization), you will optimize Q, updating p(y) for each class k by simple frequency estimation as shown in Eq. 3 and 4, except you will include the hard assignments from the E-step for the unobserved labels.
1.2.2 Initialization
To ensure consistent results across implementations, everyone must use the same initialization method. We will initialize the learning algorithm by creating labels for every example (E-step), which means we will need to generate a label for the unlabeled examples. You should assign the first unlabeled example in the data file the first label, the second unlabeled example the second label, etc. such that i.e. i % 2, where i is the index of the unlabeled example in the data file (i.e. i skips labeled examples.) You can then proceed with running the algorithm by computing an M-step based on these initial parameters. Keep the data in the order it was loaded to avoid any inconsistencies in running the algorithm.
1.2.3 Numerical Accuracy
In Naive Bayes, and many other probabilistic models, you end up multiplying many small probabilities together. In theory this is not a problem, but in practice (i.e. using 64 bit floating point numbers), it can be. This is due to the fact that you can only represent a number so small using 64 bits at a given level of precision.1 Because probabilities in our model can get very small, we need to use a small trick to ensure that it will not be rounded to 0 in 64 bit floating point number space. This trick is simply to take the log of our probabilities. Remember that products in probability space become summations in log-probability space. Given that we are taking the log of all probabilities, you will have to adapt some formulas: p(xi | yi) = M Y j=1 p(xij | yj) becomes log(p(xi | yi)) = M X j=1 log(p(xij | yi)) (8) In general, we will not give you the log-probability form. You, as statistically-enlightened computer scientists, will take our notation and translate it into the appropriate code in log space.
1.3 Multi-Task Naive Bayes
We consider an extension of Naive Bayes where, instead of predicting a single label y for each example, you will be asked to predict T distinct binary labels for each example. This is a multi-task setting where you are jointly modeling T different tasks instead of a single
1Floating point numbers in most major languages are, even though they represent a continuous (real) number, discrete. Almost all computer scientists feel that floating point numbers are evil, but most accept that they are a necessary evil. For details on how floats work, see the Wikipedia article on doubles.
Fall 2018 CS 475 Machine Learning: Homework 4 4
task. We will assume that in this case you observe every label y (no semi-supervised training). We introduce the notation yt i to represent the tth label for the i example. You will use the data files with prefix multitask. These data files have a slightly different format. The first column will contain the label, but instead of a single integer it will now contain a sequence of four digits that will be either 0 or 1. The first digit will be the first label, the second digit the second label, etc. For example, the label 0110 should be interpreted as y0 i = 0, y1 i = 1, y2 i = 1, y3 i = 0. You will determine the size of T based on the size of these labels. When you make predictions, you will encode your predictions in the same format. Please modify your code to support this new format. We will provide you with a new version of compute accuracy.py that supports this label type.
1.3.1 Independent
A simple approach to predicting binary labels for T distinct tasks is to create T distinct Naive Bayes classifiers. This will be our first approach, which we call “independent”. You can re-use your code from the first part of the assignment, and modify it as necessary. For this part, you will add the --independent-mode independent command line argument described in section 1.4.
1.3.2 Joint
The disadvantage to the independent approach is that we fail to capture any correlations between the T distinct labels. Instead, we will consider a joint model that learns these correlations.
argmax yi
p(yi | xi) = argmax yi
p(yi)
M Y j=1
p(xij | yi) (9)
where yi is a vector containing the T labels for the ith data point. In this case, we learn separate distributions for each combination of labels (i.e. on the order of 2T parameters.) Implement this with --independent-mode joint as described in section 1.4.
1.3.3 Latent Factorization
While the joint case allows us to model complex interactions between the labels, there are now a lot of parameters, which means we are more likely to overfit. We will consider another solution: introduce a latent variable that captures correlations between the labels through dimensionality reduction. We introduce a categorical latent variable z and modify the model as follows: p(yi,xi) =X zi Y l Y j p(xij | yl i)p(yl i | zi)p(zi) (10) Since zi is an unobserved variable we must use EM. In this case, we will never observe zi. The number of values zi can take on will reflect how richly we want to model the interactions between the labels. We will add this as a command line argument latent-states <int as described in section 1.4. To ensure consistent implementations, you will initialize your model by first setting the values for zi (E-step). If there are latent-states states for z, then set the zi = i % latent-states. Implement this with --independent-mode latent as described in section 1.4.
Fall 2018 CS 475 Machine Learning: Homework 4 5
1.4 Implementation Details
We recommend working with numpy structures like arrays as much as possible since they provide good support for working with probability distributions.
1.4.1 Selecting Model Assumptions You must define a new command line option to indicate modeling assumptions you are making: mode, with choices independent, joint, and latent.
parser.add_argument("--independent-mode", type=str, help="Which modeling assumptions we are making.", default=’independent’)
1.4.2 Tie-Breaking
When assigning an instance to a class, you may encounter ties, where the instance is equidistant to two classes. In case of a tie, assign it the class with the lower index.
1.4.3 Log of Zero
It is possible that a probability will be 0, which may happen when the value is close to 0 and float rounding will make it 0 due to machine precision. In this case, you can treat the value of log(0) as float(’-inf’).
1.4.4 Number of Iterations
The default number of iterations is 10. An iteration is defined as computing 1 E-step and 1-M step of the algorithm (in that order). This includes the initialization E-step. Add a command line option for this parameter: training-iterations
parser.add_argument("--training-iterations", type=int, help="The number of iterations.", default=10)
1.4.5 Latent States
You will pass in a parameter for the number of values your latent variable z can take on. The default value will be 3. Add a command line option for this parameter: latent-states
parser.add_argument("--latent-states", type=int, help="The number of latent states.", default=3)
1.4.6 Evaluation
We will use a hard accuracy for compute accuracy: meaning that it gives credit for correct predictions only when all of the labels are correct. If you’d like to compare results with each other at a finer-grained level, we recommend implementing a relaxed version of this that computes the accuracy of individual labels.
Fall 2018 CS 475 Machine Learning: Homework 4 6
1.4.7 How Your Code Will Be Called You may use the following commands to test your algorithms.
python classify.py --mode train --algorithm nb --model-file semisup.nb \ --data semisup.train --training-iterations 10 python classify.py --mode train --algorithm nb --model-file multitask.nb \ --data multitask.train --independent-mode independent python classify.py --mode train --algorithm nb --model-file multitask.nb \ --data multitask.train --independent-mode joint python classify.py --mode train --algorithm nb --model-file multitask.nb.nb \ --data multitask.train.train --training-iterations 10 --independent-mode latent --latent-states 3 To run the trained model on development data:
python classify.py --mode test --algorithm nb --model-file semisup.nb.model \ --data semisup.dev --predictions-file semisup.dev.predictions python classify.py --mode test --algorithm nb --model-file multitask.nb.model \ --data multitask.dev --predictions-file multitask.dev.predictions --independent-mode independent python classify.py --mode test --algorithm nb --model-file multitask.dev.model \ --data multitask.dev --predictions-file multitask.dev.predictions --independent-modee joint python classify.py --mode test --algorithm nb --model-file multitask.dev.model \ --data multitask.dev --predictions-file multitask.dev.predictions --independent-mode latent
2 Analytical (45 points)
1) Clustering (9 points) K-medoids (https://en.wikipedia.org/wiki/K-medoids) is an algorithm similar to K-means. Both K-means and K-medoids attempt to minimize the squared error but unlike K-means, K-medoids chooses a provided example as a cluster center (medoids) rather than the mean of a subset of the examples. Therefore, instead of selecting a new cluster center µ to be the mean of the datapoints assigned to the cluster, instead select the datapoint that has the closest average euclidean distance to the other points in the cluster. In case of a tie, select the datapoint with the lowest index (i.e. if x0 and x2 are tied, select x0).
x0 3 1 x1 2 1 x2 2 4 x3 3 3 x4 2 2 x5 1 5
Table 1: Datapoints for Question 1
1. For the dataset, run the K-medoids algorithm for two iterations, with k = 2 clusters. Select your initial cluster medoids to be µ0 = x0 and µ1 = x1 What are the final centers (i.e. µ0 and µ1)? Which data points are assigned to each cluster (cluster 0 and cluster 1)?
2. For the dataset, run the K-means algorithm for two iterations, with k = 2 clusters. Select your initial cluster means to be µ0 = x0 and µ1 = x1 What are the final centers (i.e. µ0 and µ1)? Which data points are assigned to each cluster (cluster 0 and cluster 1)?
3. What are the benefits of the K-medoids algorithm, compared to K-means (briefly, in no more than three sentences)?
Fall 2018 CS 475 Machine Learning: Homework 4 7
x0 x1 x2 x3
x4 x5 x6 x7
x8 x9 x10 x11
x12 x13 x14 x15
(a) Directed Graphical Model
x0 x1 x2 x3
x4 x5 x6 x7
x8 x9 x10 x11
x12 x13 x14 x15
(b) Undirected Graphical Model
Figure 1: These two graphs are the same. However, as (a) is directed and (b) is undirected the two graphs have different conditional independence interpretations.
2) D-separation in Graphical Models (16 points) Consider the Bayesian Network in Figure 1a and the Markov Random Field in figure 1b. For each of the following questions, decide whether the sets A and B are d-separated given set C for each of the following definitions of A, B, and C? Justify each answer. 1. A = {x4}, B = {x2}, C = {x5, x6} 2. A = {x9}, B = {x11}, C = {x14} 3. A = {x12}, B = {x10}, C = {x15, x13} 4. A = {x1, x2}, B = {x12,x11}, C = {x14, x13}
3. Factorization of MRFs (15 points) The probability density function of most Markov Random Fields cannot be factorized as the product of a few conditional probabilities. This question explores some MRFs which can be factorized in this way. Consider the graph structure in Figure 2. From this graph, we know that X2 and X3
X1
X2
X3
Figure 2: The Original Undirected Graph
are conditionally independent given X1. We can draw the corresponding directed graph as Figure 3. This suggests the following factorization of the joint probability: P(X1,X2,X3) = P(X3|X1)P(X2|X1)P(X1) Now consider the following graphical model in Figures 4, 5 and 6.
Fall 2018 CS 475 Machine Learning: Homework 4 8
X1
X2
X3
Figure 3: The Converted Directed Graph
X1 X2
X3 X4
X5 X6 X7
Figure 4: An Undirected Graph
X1 X2
X3 X4
X5 X6 X7
Figure 5: An Undirected Graph
X1 X2
X3 X4
X5 X6 X7
Figure 6: An Undirected Graph
1. Following the graph in Figure 4, decide whether you can write a factorization of the joint distribution P(X1,X2,X3,X4,X5,X6,X7). If so, write the factorization. If there is no factorization, justify why there isn’t a valid factorization briefly (no more than three sentences).
2. Following the graph in Figure 5, decide whether you can write a factorization of the joint distribution P(X1,X2,X3,X4,X5,X6,X7). If so, write the factorization. If there is no factorization, justify why there isn’t a valid factorization briefly (no more than three sentences).
3. Following the graph in Figure 6, decide whether you can write a factorization of the joint distribution P(X1,X2,X3,X4,X5,X6,X7). If so, write the factorization. If there is no factorization, justify why there isn’t a valid factorization briefly (no more than three sentences).
Fall 2018 CS 475 Machine Learning: Homework 4 9
4. True/False (10 points) For each of the following statements, decide whether they are true or false. If false, provide a justification for your answer.
1. The expectation maximization algorithm will eventually converge to the global maximum of the likelihood function.
2. Deep neural networks and graphical models can both represent relationships between dependent features.
3. The non-convexity of the K-Means likelihood function means that it is very sensitive to initialization.
4. A valid Markov blanket of a node cannot consist of all other nodes in the graph.
5. Valid directed graphical models can have cycles.
3 What to Submit
In this assignment you will submit two things.
1. Submit your code (.py files) to cs475.org as a zip file. Your code must be uploaded as code.zip with your code in the root directory. By ‘in the root directory,’ we mean that the zip should contain *.py at the root (./*.py) and not in any sort of substructure (for example hw1/*.py). One simple way to achieve this is to zip using the command line, where you include files directly (e.g., *.py) rather than specifying a folder (e.g., hw1):
zip code.zip *.py
A common mistake is to use a program that automatically places your code in a subfolder. It is your job to make sure you have zipped your code correctly. We will run your code using the exact command lines described earlier, so make sure it works ahead of time, and make sure that it doesn’t crash when you run it on the test data. A common mistake is to change the command line flags. If you do this, your code will not run. Remember to submit all of the source code, including what we have provided to you. We will include requirements.txt and provide the data, but nothing else.
2. Submit your writeup to gradescope.com. Your writeup must be compiled from latex and uploaded as a PDF. The writeup should contain all of the answers to the analytical questions asked in the assignment. Make sure to include your name in the writeup PDF and to use the provided latex template for your answers following the distributed template. You will submit this to the assignment called “Homework 4: EM and Graphical Models: Written”.
3. There is no Jupyter notebook for this assignment.