$35
Page 1 of 3
COSC 4570/5010 Data Mining
Homework #3
Submission guideline You need to submit only one .zip file. Please name the file as “Your Net
id_Homework3.zip”.
1. Problems from the book (Introduction to Data Mining 2nd Edition by Tan,
Steinbach et al.)
Solve the following:
Chapter 4: Problems 16 and 21.
Chapter 7: Problems 7 and 11.
OR
Problems from the book (Introduction to Data Mining 1st Edition by Tan,
Steinbach et al.)
Solve the following:
Chapter 5: Problems 17 and 23.
Chapter 8: Problems 7 and 11.
2. Clustering
a) Given � clusters and their respective cluster sizes �#, �%, … , �', what is the probability
that two random (with replacement) data vectors (from the clustered dataset) belong
to the same cluster?
b) Now assume you are given this probability (you don't have �)'s and �), and the fact
that clusters are equally sized, can you find �? This gives you an idea for predicting
the number of clusters in a dataset.
Page 2 of 3
c) Give an example of a dataset consisting of 4 data vectors where there exist two
different optimal (minimum SSE) 2-means (k-means, k=2) clustering of the dataset.
• Calculate the optimal SSE value for your example.
• In general, how should datasets geometrically look like so that we have more
than one optimal solution?
• What defines the number of optimal solutions?
• This problem provides an example of situations where k-means does not
necessarily con- verge to the same optimal all the time.
3. KDD Cup 2009
A very popular intrusion detection dataset is the KDD Cup 2009 dataset. The dataset was
collected at MIT Lincoln labs for 1998 DARPA Intrusion Detection Evaluation Program. Read
about the dataset and its features (that describe network traffic) here:
http://kdd.ics.uci.edu/databases/kddcup99/task.html
The class attribute value is the network attack type associated with the instance. In this
homework, your task is to perform intrusion detection using classification. You are going to use
the dataset that is uploaded in ARFF format with this homework and Weka to perform the
following:
a) Download the dataset kddcup99.zip here.
b) The dataset has around 500 thousand records. (1) Randomize your dataset and (2) take
a 10% sample of your dataset. Save your sample (“Save” from the “Preprocess” tab).
For this problem to be graded, we need this sample ARFF file, so please submit it with
assignment.
c) Classify your sample using Naive Bayes, Decision Tree Learning (J48 in Weka), and
K-NN (IBk) in Weka. Classify using 10-fold cross validation. Use default parameters
for all, except for IBk (use k=10). Save result buffers (right click on the classifier name
in “Result list”) and submit your three result buffers.
4. Text Clustering (Takes Time!)
Download the fine foods dataset from:
http://snap.stanford.edu/data/web-FineFoods.html
Perform the following:
a) Identify all the unique words that appear in the “review/text” field of the reviews.
Denote the set of such words as �.
b) Remove from � all stop words in “Long Stop word List” from
https://www.ranks.nl/stopwords. Denote the cleaned set as �.
Page 3 of 3
c) Count the number of times each word in � appears among all reviews (“review/text”
field) and identify the top 500 words.
d) Vectorize all reviews (“review/text” field) using these 500 words.
e) Cluster the vectorized reviews into 10 clusters using k-means. You are allowed to use
any program or code for k-means (Weka has k-means too). This will give you 10
centroid vectors.
f) From each centroid, select the top 5 words that represent the centroid (i.e., the words
with the highest feature values)
Submit the following:
1. Top 500 words + counts for these words.
2. The top 5 words representing each cluster and their feature values (50 words + 50
values).
3. IMPORTANT: your code and a step-by-step readme to help reproduce your results. I
should be able to get the same results by running your code and by following your
readme for this problem to get graded.