Starting from:

$30

Project 4 Clustering

EE219 Project 4
Clustering

Introduction:
Clustering algorithms are unsupervised methods for finding groups of data point that have
similar representations in a proper space. Clustering differs from classification in that no a priori
labeling (grouping) of the data points is available. As such, K-means clustering iteratively groups
data points into regions characterized by a set of cluster centroids. Each data point is then
assigned to the cluster with the nearest cluster centroid. In this project the goal is to find proper
representations of the data and evaluate the performance of clustering algorithms.
Dataset:
We work with “20 Newsgroups” dataset that we already explored in project 2. It is a collection
of approximately 20,000 documents, partitioned (nearly) evenly across 20 different newsgroups,
each corresponding to a different topic. Each topic can be viewed as a “class”.
In order to define the clustering task, we pretend as if the class labels are not available and aim
to find groupings of the documents, where documents in each group are more similar to each
other than to those in other group. These clusters capture the dependencies among the
documents that are known through class labels. We then use class labels as ground truth to
evaluate the performance of the clustering task.
In order to get started with a simple clustering task, we work with a well separable portion of
data that we used in project 2, and see if we can retrieve the known classes. Namely, let us take
all the documents in the following classes:
Class 1: Computer technology Class 2: Recreational activity
comp.graphics
comp.os.ms-windows.misc
comp.sys.ibm.pc.hardware
comp.sys.mac.hardware
rec.autos
rec.motorcycles
rec.sport.baseball
rec.sport.hockey
We would like to evaluate how purely the a priori known classes can be reconstructed through
clustering.
Problem Statement:
1) Finding a good representation of the data is fundamental to the task of clustering. Following
the steps in project 2, transform the documents into TF-IDF vectors.
2) Apply K-means clustering with k=2. Compare the clustering results with the known class
labels. Inspect the confusion matrix to evaluate how well your clusters match the ground truth
labels. Is there a permutation of the rows that makes confusion matrix look almost diagonal?
In order to make a concrete comparison of different clustering results, there are various
measures of purity a given partitioning of the data points with respect to the ground truth. The
measures we examine in this project are homogeneity score, completeness score, adjusted rand
score and the adjusted mutual info score. Homogeneity is a measure of how purely clusters
contain only data points that belong to a single class. On the other hand, a clustering result
satisfies completeness if all of its clusters contain only data points that belong to a single class.
Both of these scores span between 0 and 1; where 1 stands for perfect clustering. The Rand
Index is similar to accuracy measure, which computes similarity between the clustering labels
and ground truth labels. This method counts all pairs of points that both fall either in the same
cluster and the same class or in different clusters and different classes. Finally, adjusted mutual
information score measures mutual information between the cluster label distribution and the
ground truth label distributions.
3) As you observed, high dimensional sparse TF-IDF vectors do not yield a good clustering
performance. In this part we try to find a “better” representation tailored to how the clustering
algorithm works. Since in K-means clustering within-cluster distances are minimized in Euclidean
l2-norm sense, it turns out that we need to reduce the dimension of the data properly.
We will use Latent Semantic Indexing(LSI) and Non-negative Matrix Factorization(NMF) that you
are already familiar with for dimensionality reduction. In order to get a good initial guess for an
appropriate dimensionality to feed in the K-means algorithm, find the effective dimension of the
data through inspection of the top singular values of the TF-IDF matrix and see how many of
them are significant in reconstructing the matrix with the truncated SVD representation. You
can then change the dimensionality and choose one that yields better results in terms of
clustering purity metrics.
Now, use the following two methods for reducing the dimension of the data by sweeping over
the dimension parameter in each. (You should work with values as low as 2-3 and up to the
effective dimension you found above)
• Truncated SVD (LSI) / PCA
• NMF
Again, if the clustering purity is not satisfactory, try to find a better representation of the data by
• Normalizing features
• Applying some non-linear transformation on the data vectors [after reducing
dimensionality].
To get a visual sense on the NMF embedding of the data, try applying NMF to the data matrix
with ambient parameter 2 and plot the resulting points to choose the appropriate non-linear
transformation. Can you justify why logarithm is a good candidate for your TFxIDF data?
Report the measures of purity introduced in part 2 for the best final data representation you
use.
4) Visualize the performance of your clustering by projecting final data vectors onto 2
dimensions and color-coding the classes. Can you justify why a non-linear transform is useful?
5) In this part we want to examine how purely we can retrieve all the 20 original sub-class labels
with clustering. Therefore, we need include all the documents and the corresponding terms in
the data matrix and find proper representation through reducing the dimension of the TF-IDF
representation. In doing so, try different effective ambient space dimension for both truncated
SVD and NMF dimensionality reduction techniques and the different transformations of the
obtained feature vectors as outlined above. Then, find appropriate parameter k to use in Kmeans clustering in order to find pure clusters with respect to the class labels
Report all purity measures described earlier for the final representation that you find for your
data.
6) Evaluate the performance of your clustering in retrieving the topic-wise classes. Note that
again, you need to find a proper representation of your data through dimensionality reduction
and feature transformation.
Class 1: Computer technology
comp.graphics
comp.os.ms-windows.misc
comp.sys.ibm.pc.hardware
comp.sys.mac.hardware
comp.windows.x
Class 2: Recreational activity
rec.autos
rec.motorcycles
rec.sport.baseball
rec.sport.hockey
Class 3: Science
sci.crypt
sci.electronics
sci.med
sci.space
Class 4: Miscellaneous
misc.forsale
Class 5: Politics
talk.politics.misc
talk.politics.guns
talk.politics.mideast
Class 6: Religion
talk.religion.misc
alt.atheism
soc.religion.christian
Submission: Please submit a zip file containing your report, and your codes with a readme
file on how to run your code to ee219.winter2017@gmail.com. The zip file should be named as
"Project2_UID1_UID2_..._UIDn.zip" where UIDx are student ID numbers of the team members.
If you had any questions you can send an email to the same address.

More products