$29.99
COSI 126A: Homework 1
Section I: Clustering Problems
Problem 1 (5 points)
Many partitional clustering algorithms that automatically determine the number of clusters
claim that this is an advantage. List two situations in which this is not the case.
Problem 2 (5 points)
Identify the clusters below using the center-, contiguity-, and density- based definitions. Also
indicate the number of clusters for each case and give a brief indication of your reasoning.
Note that darkness or the number of dots indicates density. If it helps, assume center-based
means K-means, contiguity-based means single link, and density-based means DBSCAN.
Problem 3 (10 points)
Hierarchical clustering algorithms require O(m2
log(m)) time, and consequently, are impractical to use directly on larger data sets. One possible technique for reducing the time required
1
is to sample the data set. For example, if K clusters are desired and √
m points are sampled
from the m points, then a hierarchical clustering algorithm will produce a hierarchical clustering in roughly O(m) time. K clusters can be extracted from this hierarchical clustering
by taking the clusters on the Kth level of the dendrogram. The remaining points can then be
assigned to a cluster in linear time, by using various strategies. To give a specific example,
the centroids of the K clusters can be computed, and then each of the m −
√
m remaining
points can be assigned to the cluster associated with the closest centroid.
For each of the following types of data or clusters, discuss briefly if (1) sampling will cause
problems for this approach and (2) what those problems are. Assume that the sampling
technique randomly chooses points from the total set of m points and that any unmentioned
characteristics of the data or clusters are as optimal as possible. In other words, focus only
on problems caused by the particular characteristic mentioned. Finally, assume that K is
very much less than m.
(a) Data with very different sized clusters.
(b) High-dimensional data.
(c) Data with outliers, i.e., atypical points.
(d) Data with highly irregular regions.
(e) Data with globular clusters.
(f) Data with widely different densities.
(g) Data with a small percentage of noise points.
(h) Non-Euclidean data.
(i) Euclidean data.
(j) Data with many and mixed attribute types.
Problem 4 (8 points)
(a) Compute the entropy and purity for each cluster in the confusion matrix below.
(b) Compute the total entropy and total purity.
(c) Compute the following F-measure: F(#3,Water)
Cluster Normal Water Grass Fire Electric Ground Flying Ghost
#1 8 22 0 0 767 4 45 22
#2 654 34 89 123 12 76 13 2
#3 6 301 2 3 98 23 31 1001
#4 4 21 34 2 3 543 112 0
2
Problem 5 (7 points)
(a) Given the set of cluster labels and similarity matrix shown below, compute the correlation between the similarity matrix and the ideal similarity matrix, i.e., the matrix
whose ijth entry is 1 if two objects belong to the same cluster, and 0 otherwise.
(b) Compute the silhouette coefficient for each point, each of the three clusters, and the
overall clustering.
Point Cluster Label
P1 1
P2 1
P3 2
P4 2
P5 3
P1 P2 P3 P4 P5
P1 1 0.92 0.33 0.61 0.82
P2 0.92 1 0.43 0.01 0.22
P3 0.33 0.43 1 0.75 0.11
P4 0.61 0.01 0.75 1 0.17
P5 0.82 0.22 0.11 0.17 1
3
Section II: Programming
Please submit a Python package implementing the features specified below. A skeleton file
will be provided.
Part I: Cluster Validity (15 points)
1. Implement the following external measurements:
(a) Accuracy
(b) Normalized Mutual Information
(c) Normalized Rand Index
2. Implement the following internal measurements:
(a) Silhouette Index
(b) Clustering Validation Index based on Nearest Neighbors (CVNN)
References:
• Liu, Y., Li, Z., Xiong, H., Gao, X., Wu, J., & Wu, S. (2013). “Understanding and
enhancement of internal clustering validation measures.” IEEE transactions on cybernetics, 43(3), 982-994.
• Wu, J., Xiong, H., & Chen, J. (2009, June). “Adapting the right measures for kmeans clustering.” Proceedings of the 15th ACM SIGKDD International Conference
on Knowledge Discovery and Data Mining (pp. 877-886). ACM.
Part II: K-Means Clustering (20 points)
1. Implement Lloyd’s K-means algorithm. Ensure that it works for the basic dataset
three globs.csv. Provide the following three initialization methods:
(a) Random
(b) K-means++
(c) Global K-means
2. Implement Hartigan’s K-means algorithm.
3. Use the internal measurements in Part 1 to find the proper cluster number for the
image segmentation.csv dataset.
4. Given the true cluster number, run your Lloyd’s K-means algorithm on the image segmentation.csv
dataset, and evaluate the results in terms of the external measurements completed in
Part I.
4
5. Run the algorithm ten times, and record the average, standard deviation, and execution
time.
References:
• Arthur, D., & Vassilvitskii, S. (2007, January). “k-means++: The advantages of
careful seeding.” Proceedings of the eighteenth annual ACM-SIAM Symposium on
Discrete Algorithms (pp. 1027-1035). Society for Industrial and Applied Mathematics.
• Likas, A., Vlassis, N., & Verbeek, J. J. (2003). “The global k-means clustering algorithm.” Pattern Recognition, 36(2), 451-461.
Part III: DBSCAN(15 points)
1. Implement your own DBSCAN. 5-point bonus for an implementation within 40 lines.
2. Run your algorithm on the anthill.csv dataset. You can use the recommended parameters in the textbook, or you can choose your own. Evaluate the performance in terms
of the external measurements in Part 1.
Part IV: Spectral Clustering and Kernel K-means (15 points)
1. Implement your own spectral clustering algorithm.
2. Implement your own Kernel K-means algorithm.
3. Run both algorithms on the eye.csv dataset. Demonstrate empirically whether they
are the same or not.
References:
• Dhillon, I. S., Guan, Y., & Kulis, B. (2004, August). “Kernel k-means: spectral
clustering and normalized cuts.” Proceedings of the tenth ACM SIGKDD international
conference on knowledge discovery and data mining (pp. 551-556). ACM.
Final Notes
• Do not hard-code your work. Expect that it will be run on datasets other than the
ones provided.
• Supplementary libraries such as pandas and numpy may be used. However, do not use
any features directly related to the algorithms in question.
• Submissions will be checked for plagiarism using MOSS. Plagiarized work will result
in a 0 for the assignment and possible disciplinary action.
5