$29.99
Page 1 of 3
COSC 4570/5010 Data Mining
Homework #1
Submission guideline You need to submit only one .zip file. Please name the file as “Your Net
id_Homework1.zip”.
1. Problems from the book
Solve the following from Chapter 2 from Problems 2, 5, 13, 15, and 16.
2. Sampling
• When is sampling with replacement appropriate and when is sampling without
replacement more preferable? Provide two examples where each is more appropriate.
• Samples obtained uniformly at random can miss anomalies (underrepresented data
points) in datasets. How can we sample more systematically using PCA?
3. Curse of Dimensionality
Due to curse of dimensionality distances become meaningless in high-dimensional spaces. That
is, the minimum distance between random pairs of nodes becomes really close to the maximum
distance between such pairs. In this problem, you are going to verify this phenomenon.
Write a program in Java, C, C++, Python, or MATLAB that
a. Generates n d-dimensional random points.
b. Computes the maximum and minimum distance using Euclidean distance between all !"
#$
pairs of nodes (i.e. �(�#) ����������). Denote these values as ���(�, �) and ���(�, �)
respectively.
c. Computes �(�, �) = log ;<=(>,")?;@A(>,")
;@A (>,")
Change n in range 100 ≤ � ≤ 1,000 and d in range 1 ≤ � ≤ 100 and assume that
feature values are in range [0,100] (or some other fixed range). Compute �(�, �) using
your program and plot the 3-D surface of �(�, �) in MATLAB or your programming
language of choice.
Page 2 of 3
How does the surface change with respect to n? Perform the same experiment, but this
time using �F norm for computing distances (Book, Page 70) and plot the surface. Submit
your code and your two plots.
4. Weka
Download and install Weka from http://www.cs.waikato.ac.nz/ml/weka/ . The following
tutorial is useful:
https://www.cs.auckland.ac.nz/courses/compsci367s1c/tutorials/IntroductionToWeka.pdf
You can also watch Weka Tutorials on YouTube. Get used to Weka and learn how to
modify data in Weka. In particular, learn how to generate ARFF files for Weka from csv
files (open an arff file in notepad from Weka data folder + visit
http://www.cs.waikato.ac.nz/ml/weka/ arff.html.) You can use Weka’s own CSV-to-ARFF
converter or you can write your own header for ARFF files.
Load a dataset from the data folder of Weka (e.g., weka-3-6/data/weather.numeric.arff).
You can always see the current stage of your dataset using the “edit” button. For each
of the following, determine how it can be done using Weka and submit as part of your
homework, the proper command and parameters. For example, if you want to center
your data using Weka, you need to use the Center filter, under
weka.filters.unsupervised.attribute.Center.
• Center data (having zero mean): Center
• Removing attribute 2 to 4
• Removing all attributes but the last
• Removing Reordering attributes 1,2,3,4,5 as 5,4,1,2,3
• Removing instances with missing values:
• How is a missing value denoted in an ARFF file?
• What does “visualize all” do?
• Removing all instances where the 3nd feature value is equal to ‘x’.
5. PCA
The most known dataset in data mining/machine learning is the Iris dataset. Learn about
this dataset at the following URL:
https://archive.ics.uci.edu/ml/datasets/iris.
Download the dataset (in Data Folder > iris.data). The dataset contains the information
gathered on three types of iris plant: Iris Setosa, Iris Versicolour, and Iris Virginica. To
see if different types of Iris plants are distinguishable from one another, we can just
Page 3 of 3
visualize our dataset. The Iris dataset is a four-dimensional dataset; therefore, it cannot
be visualized in 2D or 3D. Apply PCA to the dataset and reduce the dimensionality to
two. Plot the dimensionality- reduced dataset. Color data points based on their plant
type (e.g., Iris-setosa can be red, etc.). The plant type is given as the fifth column in the
dataset. The figure should show the type(s) that can be easily distinguished from others.
Which Iris type is the easiest to distinguish from the rest? Submit your code and plot.