$30
CSE4334/5334 Data Mining Assignment 2
What to turn in:
1. Your submission should include your complete code base in an archive file (zip, tar.gz) and q1/,
q2/, and so on), and a very very clear README describing how to run it.
2. A brief report (typed up, submit as a PDF file, NO handwritten scanned copies) describing what you
solved and implemented and known failure cases. The report is important since we will be evaluating
the grades mostly based on the report.
3. Submit your entire code and report to Blackboard.
Notes from instructor:
• Start early!
• You may ask the TA or instructor for suggestions, and discuss the problem with others (minimally).
But all parts of the submitted code must be your own.
• Use Matlab or Python for your implementation.
• Make sure that the TA can easily run the code by plugging in our test data.
Problem 1
(Naive Bayes, 100pts) Generate 1000 training instances in two different classes (500 in each) from multivariate normal distribution using the following parameters for each class
µ1 = [1, 0], µ2 = [0, 1], Σ1 =
?
1 0.75
0.75 1 ?
, Σ2 =
?
1 0.75
0.75 1 ?
(1)
and label them 0 and 1. Then, generate testing data in the same manner with 500 instances for each class,
i.e., 1000 in total.
1. (30pt) Implement your Naive Bayes Classifier [pred, posterior, err] = myNB(X,Y,X test,Y test)
whose inputs are the training data X, labels Y for X, testing data X test and labels Y test for X test
and returns predicted labels pred, posterior probability posterior with which the prediction was made
and error rate err. Assume Gaussian (normal) distribution on the data: there are two parameters that
realizes the probability density function (pdf), i.e., µ and σ. You can use functions such as normpdf or
pdf in matlab (or equivalent functions in Python) to obtain likelihood from Gaussian pdf. Derivation
of Naive Bayes looks complicated, but its actual implementation should be simple if you understand
the concept of Naive Bayes Classifier (you only need the last few slides of our lecture slides for this
topic.)
2. (10pt) Perform prediction on the testing data with your code. In your report, report the accuracy,
precision and recall as well as a confusion matrix. Also, make sure to include a scatter plot of data
points whose labels are color coded (i.e., the samples in the same class should have the same color) in
the report.
3. (20pt) In your training data, change the number of examples in each class to {10, 20, 50, 100, 300, 500}
and perform prediction on the testing data with your code. In your report, show a plot of changes of
accuracies w.r.t. the number of examples and write your brief obervation.
Instructor: W. H. Kim (won.kim@uta.edu), TA: Xin Ma (xin.ma@mavs.uta.edu) Page 1 of 2
CSE4334/5334 Data Mining Assignment 2
4. (10pt) Now, in your training data, change the number of examples in class 0 as 700 and the other as
300. Perform prediction on the testing dataset. How does the accuracy change? Why is it changing?
Write your own observation.
5. (30pt) Write a code to plot an ROC curve and calculate Area Under the Curve (AUC) based on the
posterior for class 1 (i.e., the confidence measure for class 1 is the posterior). The implementation
should be done on your own without using explicit library that lets you draw the curve. Report the
ROC curves from the two cases discussed in P1-2 and P1-4 above (i.e., one with equal distribution of
classes and unequal distributions in the training data).
Instructor: W. H. Kim (won.kim@uta.edu), TA: Xin Ma (xin.ma@mavs.uta.edu) Page 2 of 2