Starting from:

$25

COSI 126A Homework 0

COSI 126A Homework 0

Problem 1 (9 points)
Discuss whether or not each of the following activities is a data mining task.
(A) Dividing the customers of a company according to their gender.
(B) Dividing the customers of a company according to their profitability.
(C) Computing the total sales of company.
(D) Sorting a student database based on student identification numbers.
(E) Predicting the outcomes of tossing a fair pair of dice.
(F) Predicting the future stock price of a company using historical records.
(G) Monitoring the heart rate of a patient for abnormalities.
(H) Monitoring seismic waves for earthquake activities.
(I) Extracting the frequencies of a sound wave.
Problem 2 (10 points)
Suppose that you are employed as a data mining consultant for an Internet search engine
company. Describe how data mining can help the company by giving specific examples
of how techniques, such as clustering, classification, association rule mining, and anomaly
detection can be applied.
1
Problem 3 (10 points)
For each of the following data sets, explain whether or not data privacy is an important
issue.
(A) Census data collected from 1900-1950.
(B) IP addresses and visit times of Web users who visit your Website.
(C) Images from Earth-orbiting satellites.
(D) Names and addresses of people from the telephone book.
(E) Names and email addresses collected from the Web.
Problem 4 (15 points)
Matrix A =


1 2 3
2 4 6
3 6 9

 , calculate A−1
, A+, A100
Problem 5 (14 points)
Assume there three students, X, Y , Z. Only one of them gets a score A+. X asks Teacher
if he gets A+. Teacher refuses to tell X his score. Instead, Teacher says that Y does not get
A+. Calculate P(Z gets A+)
Problem 6 (14 points)
There are two kinds of products in a warehouse, A and B. The percentage of A is 70%, B
is 30%. The probability of substandard products in A is P(A = sub) = 2.5%, for B, it’s
P(B = sub) = 5%. Warehouse tests 4 products and one of them is substandard. What is
the probability that this product is from A, P(this sub from A)
Problem 7 (14 points)
Calculate the similarity matrix between 9 planets. The data of planets is in Table 1.
You can use s(p1, p2) = p
a0(d1 − d2)
2 + a1(r1 − r2)
2 + a2(m1 − m2)
2 as the metric,
where a0 = 3.5 ∗ 10−7
, a1 = 1.6 ∗ 10−5
, a2 = 1.1 ∗ 10−27
.
Set a threshold to separate 9 planets into different groups. What is the relationship between
threshold and groups.
2
Table 1: Data of Nine Planets
Planet Distance to Sun (km) Radius (km) Mass (kg)
p d r m
Jupiter 778000 71492 1.90e27
Saturn 1429000 60268 5.69e26
Uranus 2870990 25559 8.69e25
Neptune 4504300 24764 1.02e26
Earth 149600 6378 5.98e24
Venus 108200 6052 4.87e24
Mars 227940 3398 6.42e23
Mercury 57910 2439 3.30e23
Pluto 5913520 1160 1.32e22
Problem 8 (14 points)
Given N documents. Write a Python program to find the most frequent
1. < word >
2. < word1, word2 >
3. < word1, word2, word3 >
e.g. D1 = {aa aa a aaa}, D2 = {aa aa aaa}, D3 = {aaa}, most frequent < word > is
< aaa > whose frequency is 3, < word1, word2 > is < aa, aaa > whose frequency is 2,
< word1, word2, word3 > is < a, aa, aaa > whose frequency is 1
3

More products