$30
Assignment 3: Clustering
Four datasets (Iris, YeastGene, Example, Utilities) can be found on Piazza. In each dataset, each row corresponds to an object and each column corresponds to an attribute. The attribute values are comma-separated. You don’t need to do normalization for any of the datasets. The original Utilities dataset (Utilities_Unnormalized.xls) is un-normalized, but it is provided just for you to know the meaning of the clusters. DO NOT use the un-normalized version in this project.
The initial centroids to be used in K-means for Iris and YeastGene datasets are provided. In each file, each row denotes the initial centroid of a cluster.
In this assignment, you are asked to implement K-means algorithm and Agglomerative algorithm (using Min to define inter-cluster distance).
Please take the following steps:
1. Implement K-means algorithm as follows:
Repeat T times:
• For each object xi
Calculate Euclidean distance between xi and each of the K centroids
Assign xi to the cluster whose centroid is the closest to xi
• For each cluster
Calculate its centroid as the mean of all the objects in that cluster
2. Implement Hierarchical clustering algorithm (with Min as inter-cluster distance definition):
• Obtain the distance matrix by computing Euclidean distance between each pair of objects
• Let each object be a cluster (Assign the cluster index as 1 to N, N—the number of objects)
• Set the current index as T=N+1
• Repeat
Find the smallest entry in the distance matrix—suppose the entry is i-th row and j-th column
Merge the clusters that correspond to the i-th row and j-th column of the distance matrix as a new cluster with index T
Remove the rows and columns of the two old clusters and add new row and column for the new cluster to the distance matrix by computing the distance between the new cluster and each of the remaining clusters
T=T+1
• Until only one cluster remains
3. Test your K-means on Iris dataset. Use the provided initial centroids, and set the number of iterations (T) as 12, and the number of clusters (K) as 3. The final cluster centroids should be:
Cluster 1: 5.006,3.418,1.464,0.244
Cluster 2: 6.8538,3.0769,5.7154,2.0538
Cluster 3: 5.8836,2.741,4.3885,1.4344
4. Apply the PCA and plotting functions you used in Assignment 1 to project Iris data to 2 dimensions and draw the scatter plot. Use different colors for different clusters in the scatter plot. The plot should look like the one below:
chr
5. If you get the correct final cluster centroids and the plot, then repeat steps 3 and 4 on the YeastGene dataset. Use the provided initial centroids, and set the number of iterations (T) as 7, and the number of clusters (K) as 6.
6. Apply Hierarchical Clustering algorithm implemented in Step 2 on the Example dataset. The order of the merging should be:
1 2 7
3 7 8
4 5 9
6 9 10
8 10 11
Each row here denotes an operating of merging two clusters to form a new cluster. The first two indices denote the clusters to be merged, and the last one denotes the index of the new cluster.
7. If you get the correct order in Step 6, then apply the hierarchical clustering algorithm on the Utilities dataset.
8. Prepare your submission. Your final submission should be a zip file named as First Name_Last Name.zip (e.g., Jing_Gao.zip). In the zip file, you should include:
• A folder “Code”, which contains all the codes used in this assignment. Inside the folder, please have a file “README” which describes how to run your code.
• Report: A doc or pdf file named as First Name_Last Name.doc or First Name_Last Name.pdf The report should consist of the following parts: 1) The cluster centroids obtained on YeastGene dataset after the T iterations. 2) The scatter plot obtained on YeastGene dataset after applying PCA and coloring points by different colors. 3) The order of merging in the hierarchical clustering on the Utilities dataset. 4) The codes of your K-means and hierarchical clustering algorithm implementation.
9. Log in any CSE department server and submit your zip file as follows:
submit_cse469 Jing_Gao.zip
(replace “Jing_Gao” by your name)
Please refer to Course Syllabus for late submission policy. We will take the submission time recorded by the server as the time of your submission.