$35
Page 1 of 2
COSC 4570/5010 Data Mining
Homework #4
Submission guideline You need to submit only one .zip file. Please name the file as “Your Net
id_Homework4.zip”.
1. Problems from the book (Introduction to Data Mining 2nd Edition by Tan,
Steinbach et al.)
Solve the following:
Chapter 7: Problems 18 and 29.
OR
Problems from the book (Introduction to Data Mining 1st Edition by Tan,
Steinbach et al.)
Solve the following:
Chapter 8: Problems 18 and 29.
2. Clustering
Normalized Mutual Information (NMI) is used to evaluate clustering results when the actual
clustering of the data (the number of clusters and the clustering assignments) is known beforehand. NMI can be calculated using Equation 1, where � and ℎ are clusters from two different
clusterings, �$ and �% are the number of data points in the clusters ℎ and � respectively, �$,% is
the number of data points in cluster ℎ as well as cluster �, and � is the size of the dataset.
��� = ∑ -.,/ log3 3.,/
3.3/ .,/
45∑ -. log3.
. 3 6 45∑ -/ log3/
/ 3 6
(1)
what are the maximum and minimum values for the NMI? Provide details.
Page 2 of 2
3. Community Detection
Download and read the Karate Club network (you can get from the Github repository of
[DSCN]1 or from here http://www-personal.umich.edu/~mejn/netdata/karate.zip). The story
behind the data set is quite simple: There was a Karate Club that had an administrator “John A”
and an instructor “Mr. Hi” (both pseudonyms). Then a conflict arose between them, causing the
students (Nodes) to split into two groups. One that followed John and one that followed Mr. Hi.
a) Compute the following statistics describing the datasets:
• number of nodes �
• number of edges �
Present your results.
b) Preprocess the data.
• store the ground truth i.e. the club each student joined.
club ‘John A’ members were the students with following ids. 1, 2, 3, 4, 5, 6, 7,
8, 11, 12, 13, 14, 17, 18, 20, 22 whereas the rest of the students joined the club
‘Mr. Hi’.
• transform the data graph into adjacency matrix.
c) Write a program that computes spectral clustering (try different types of affinity).
Comment your code. Submit your code. Report the following.
• compare the clustering result with the stored ground truth.
• runtime of the algorithm (HINT: to get this number re-execute your
computation 5-10 times and take the mode runtime).
1 https://github.com/datascienceandcomplexnetworks/book_code/tree/master/Notebook_Chapter_IV_WWW/data