1 Aim
Please cluster the words among instances(documents) together to find the key-wordcombinations in clusters/topics from the given dataset with various methods taught in class
up to Chapter Eleven of the textbook. Other than the necessary data preprocessing such as
scaling, normalization etc., it is demanded in Homework # 3 assignment to practice NLP
and Clustering. Also, please try to summarize your observation from the clustering results.
You may apply new methods or use new packages to improve the the quality of clustering, but
if you do so, you have to give a brief introduction of the key concepts and provide necessary
citations, instead of just direct copy paste or importing. However, in this assignment, you
are not allowed to use any neural network related models (e.g., multilayer perceptron etc).
In case any neural network related method is applied, you will receive no credits. Once an
algorithm package is merged or imported into your code, please list the package link in your
reference and describe its mathematical concepts in your report followed by the reason for
adoption.
2 Dataset Description
Artificial intelligent becomes a hot area for research in machine learning. Since most of
the researches in Google are more application oriented, we are interested in what kinds of
1AI topics being investigated in Google’s published researches. Here we offer the dataset,
Google AI published Research, which is crawled from https://ai.google/research/pubs/ [1].
In this dataset, we only offer ‘title’ and ‘abstract’ and concatenate both.
3 Submission Format
You have to submit a compressed file hw3 studentID.zip which contains the following
files:
1. hw3 studentID.ipynb: detailed report, Python codes, results, discussion and mathematical descriptions;
2. hw3 studentID.tplx: extra Latex related setting, including the bibliography;
3. hw3 studentID.bib: citations in the "bibtex" format;
4. hw3 studentID.pdf: the pdf version of your report which is exported by your ipynb
with
(a) %% jupyter nbconvert - -to latex - -template hw3 studentID.tplx
hw3 studentID.ipynb
(b) %% pdflatex hw3 studentID.tex
(c) %% bibtex hw3 studentID
(d) %% pdflatex hw3 studentID.tex
(e) %% pdflatex hw3 studentID.tex
5. Other files or folders in a workable path hierarchy to your jupyter notebook (ipynb).
4 Coding Guidelines
For the purpose of individual demonstration with TA, you are required to create a function code in your jupyter notebook, as specified below, to reduce the data dimensionality,
learn a classification model, and evaluate the performance of the learned model.
• hw3 student ID demo(in x, in label, mode)
{ in x: [string] CSV file for ‘data’.
{ mode: [string] mode=‘preprocessing’ for transforming the text instances into
a tokenized word vector matrix M 2 RD×v, which is an matrix for demonstrating
the contents in D documents with v words. Each row represents a document
instance while each column stands for a selected word. The matrix M should be
2a matrix whose (i; j)-th entry is the count of jth selected word appearing in the
ith document. M can be computed via CountVectorizer. Please set matrix M
as global and return M when mode=‘preprocessing’. In the meantime, please
transpose these v words into a column with the same index order as the columns
in M. Then record this column of words into HW3 studentID words.csv with
header ‘words’.
mode=‘clustering’ for building models and dumping the clustering result and
some clustering parameters.
In mode=‘clustering’, please output the following ‘CSV’ files with headers. In the
following, ‘avg silhouette’2 [−1; 1] is the average of all silhouette scores for all v words in
HW3 studentID words.csv. Also, most of methods below are based on the subpackage
in ‘sklearn’.
• KMeans: Please transpose the matrix M.
file 1 ’HW3 studentID KMeans.csv’ with header
avg silhouette, n clusters
∗ n clusters: is n clusters
file 2 ’HW3 studentID KMeans output.csv’: For each topic/cluster, Please output 20
words with highest silhouette values. If a cluster has less than 20 words, please
fill the rest with ‘NA’. The header for this this file is
word0,word1,word2,word3,word4,word5,word6,word7,word8,word9,
word10,word11,word12,word13,word14,word15,word16,word17,word18,word19.
• KMeans++: Please transpose the matrix M.
file 1 ’HW3 studentID KMeanspp.csv’ with header
avg silhouette, n clusters
∗ n clusters: is n clusters
file 2 ’HW3 studentID KMeanspp output.csv’: For each topic/cluster, Please output
20 words with highest silhouette values. If a cluster has less than 20 words, please
fill the rest with ‘NA’. The header for this this file is
word0,word1,word2,word3,word4,word5,word6,word7,word8,word9,
word10,word11,word12,word13,word14,word15,word16,word17,word18,word19.
• Fuzzy KMeans: Please transpose the matrix M.
file 1 ’HW3 studentID FKMeans.csv’ with header
avg silhouette, n clusters, fuzzy coeff,HW3 silhouette thr The notations
are described later.
3file 2 ’HW3 studentID FKMeans output.csv’: For each topic/cluster, Please output 20
words with highest silhouette values. If a cluster has less than 20 words, please
fill the rest with ‘NA’. The header for this this file is
word0,word1,word2,word3,word4,word5,word6,word7,word8,word9,
word10,word11,word12,word13,word14,word15,word16,word17,word18,word19.
• Agglomerative: Please transpose the matrix M.
file 1 ’HW3 studentID Agglomerative.csv’ with header
avg silhouette, n clusters, affinity, linkage
∗ n clusters: is n clusters;
∗ affinity : is affinity;
∗ linkage: is linkage;
file 2 ’HW3 studentID Agglomerative output.csv’: For each topic/cluster, Please output 20 words with highest silhouette values. If a cluster has less than 20 words,
please fill the rest with ‘NA’. The header for this this file is
word0,word1,word2,word3,word4,word5,word6,word7,word8,word9,
word10,word11,word12,word13,word14,word15,word16,word17,word18,word19.
• LatentDirichletAllocation(LDA): Please ‘Do Not’ transpose the matrix M.
file 1 ’HW3 studentID LDA.csv’ with header
avg silhouette, n clusters, learning method,HW3 silhouette thr
∗ n clusters: is n components;
∗ learning method: is learning method ;
∗ HW3 silhouette thr: is the threshold to the new silhouette score especially
for soft clustering labels. It will be explained later.
file 2 ’HW3 studentID LDA output.csv’: For each topic/cluster, Please output 20 words
with highest silhouette values. If a cluster do not have more than 20 words, please
fill the rest with ‘NA’. The header for this this file is
word0,word1,word2,word3,word4,word5,word6,word7,word8,word9,
word10,word11,word12,word13,word14,word15,word16,word17,word18,word19.
Each method will be graded as 20% of demonstration. Please note that, every method
should write ’HW3 studentID fmethodg.csv’ with the highest average sihouette value or the
elbow of average sihouette value in the figure of avg sihouette v.s. n clusters, including the
corresponded ‘n clusters’. Note that
method2fLDA, Agglomerative,KMeans,KMeanspp,FKMeansg.
In this homework assignment and demonstration, we need to install the package of fuzzy
KMeans and put extra function for evaluating the sihouette score for soft clustering labels
in Fuzzy KMeans and Latent Dirichlet Allocation.
4• For Fuzzy KMeans, please install sklearn extensions with [2]
pip install sklearn extensions - -upgrade;
from sklearn extensions.fuzzy kmeans import FuzzyKMeans
import numpy as np
fuzzy kmeans=FuzzyKMeans(k=n clusters,m=fuzzy coeff)
fuzzy kmeans model=fuzzy kmeans.fit(np.transpose(M))
soft cluster label=fuzzy kmeans model.fuzzy labels
In method of fuzzy KMeans,
{ n clusters: is k;
{ fuzzy coeff: is m;
{ HW3 silhouette thr: is the threshold to the new silhouette score especially for
soft clustering labels. It will be explained later;
• In order to evaluate the quality of LDA and Fuzzy KMeans among different number
of topics/clusters, we estimate their silhouette score as following. Assume there are v
words and k topics/clusters. Let Y 2 Rk×v be the matrix in soft cluster labels and the
summation of each row in Y is one. Let s be the threshold to select elements in Y
Y~ij =
8<:
Yij; if Yij ≥ s
Yij; if max
j
Yij ≤ s
0; otherwise
:
Note that when max
j
Yij ≤ s, there are not clusters for the word wi could dominate in
the soft clustering labels. We just keep its original values. Let Y^ be the normalized
Y~ where the summation of each row in Y^ is equal to one. Then we can compute the
parameter of a silhouette score for each word wi; i 2 f1; : : : ; vg under jth topic/cluster
as
aij =
vP r
=1
Y^rjd(wi; wr)
vP r
=1
Y^rj
bij = min
α2f1;:::;kg;α6=j
vP r
=1
^ Y
rαd(wi; wr)
vP r
=1
^ Y
rα
where d(wi; wr) is the metric between word wi and word wr:
Hence the silhouette score ξi for the word wi would be
ξi =
kX j
=1
Y^ij bij − aij
maxfbij; aijg;
which is also in the range [−1; 1]:
55 Report Requirement
• List names of packages used in your program;
• A flowchart for Preprocessing.
• Compare results among 5 methods;
• Describe Observation and conclusion;
• Describe the mathematical concepts of any new algorithms or models employed as
well as the roles they play in your feature selection/extraction or classification task in
Markdown cells [3].
5.1 Basic Requirement
• Implement five methods after the pre-processing is finished.
• Based on the average of silhouette scores, decide the the clustering number. The
grading will also refer to this value.
• Please make sure hw3 student ID demo is functional and can output the required
files in both mode=‘preprocessing’ and mode=‘clustering’.
• If you apply new methods or use new packages to improve the classification performance, you have to give a brief introduction of the key concepts and provide necessary
citations/links, instead of just direct copy paste or importing.
• Please submit your ‘report’ in English. Be aware that a ‘report’ is much more than a
‘program.’
References
[1] Google ai published research. https://ai.google/research/pubs/. Accessed: 2018-
05-17.
[2] Fuzzy kmeans from the third party (open source). http://wdm0006.github.io/
sklearn-extensions/fuzzy_k_means.html. Accessed: 2018-05-22.
[3] Markdown. https://daringfireball.net/projects/markdown/basics. Accessed:
2018-03-29.
6