Starting from:

$30

Assignment 4-Dimension Reduction & Clustering

CS534 — Implementation Assignment 4 — 
General instruction.
1. This is a bonus assignment. If submitted, this can count for up to 5% of the final grade.
2. The following languages are acceptable: Java, C/C++, Matlab, Python and R.
3. You can work in team of up to 3 people. Each team will only need to submit one copy of the source code
and report.
4. Your source code and report will be submitted through the TEACH site https://secure.engr.oregonstate.
edu:8000/teach.php?type=want_auth
Please clearly indicate your team members’ information.
5. Be sure to answer all the questions in your report. You will be graded based on your code as well as the
report. So please write your report in clear and concise manner. Clearly label your figures, legends, and
tables.
6. In your report, the results should always be accompanied by discussions of the results. Do the results
follow your expectation? Any surprises? What kind of explanation can you provide?
Dimension Reduction & Clustering
In this assignment you will implement 1) Principle Component Analysis (PCA); 2) Linear Discriminant
Analysis (LDA); and 3) Kmeans clustering. You will run your implementation on the provided data set,
which are sensor readings of a smart phone while the user is doing various activities (walking, standing,
sitting, etc). For our purposes, we have filtered this data down to only two categories: walking up stairs and
walking down stairs1
. The dimensionality is 477 different sensor readings.2
You will apply PCA and LDA on the “walking” data set to reduce the dimensions, and then K-Means
clustering to the resulting reduced data, comparing the results. You will use the purity measure to measure
the resulting clustering performance. Specifically, to measure purity using ground truth class labels, you
first assign each cluster (and all of its instances) a class label based on the majority class it contains.
Purity simply measures the accuracy of the resulting assignment. You need to submit source code of your
implementations (PCA, LDA Kmean, and the experimental evaluation of them). Be sure that your code is
properly commented to enhance its readability.
Specifically you need to conduct the following experiments with your implementation.
1. (15 pts) Fix k = 2, apply kmeans to the original data (477 dimensions). Measure and report the class
purity of the resulting clustering solution. Specifically, you will need to measure purity using ground
truth class labels. First assign each cluster (and all of its instances) a class label based on the majority
class it contains. Purity simply measures the accuracy of the resulting assignment. Because kmeans is
sensitive to random initialization, you should randomly restart kmeans 10 times, then pick the solution
with the best Sum Squared Error objective, and then measure its class purity.
2. (10 pts) Compute the principal components of the data. To do this, you need to first compute the
covariance matrix of your data using the equation on the dimension reduction slides. Then compute
the eigen vectors of the covariance matrix and sort them according to the eigen values (in decreasing
order). Note that you don’t need to implement your own eigen decomposition function. Feel free to
use any numerical package for this purpose. For example, in matlab, function eig can be used. Use the
results to answer the following question: what is the smallest dimension we can reduce to if we wish
to retain at least 80% and 90% of the total variance respectively?
3. (10 pts) Use the principal components computed in (2) to reduce the data from 477 to 1, 2 and 3
dimensions respectively. For each choice, apply kmeans with k = 2 to the resulting reduced data and
report their purity measures. How do they compare to the results reported in (1)?
1For our labels, we have denoted 0 as walking down and 1 as walking up. So yi = 0 means yi = down, and yi = 1 means
yi = up.
2The original data set was downloaded from https://www.kaggle.com/uciml/human-activity-recognition-with-smartphones.
We have processed this data and then filtered it down.
1
4. (15 pts) Compute the project direction that best separates walking upstairs from walking downstairs by
applying Linear discriminant analysis. For this part, you will use the class label as LDA is a supervised
dimension reduction technique. First you compute the mean for each walking activity separately (say
m1 and m2). You will then compute the within-class scatter matrix, assuming xi
is a column vector
of 477 dimensions representing the i-th activity in the data, using the following equation:
S =
X
yi=down
(xi − m1) ∗ (xi − m1)
T +
X
yi=up
(xi − m2) ∗ (xi − m2)
T
The projection vector is then computed as w = S
−1 ∗ (m1 −m2). Note that similarly you don’t need to
implement the inversion function. Use existing numerical package (e.g., inv for matlab) for this purpose
is fine. Project the data onto this direction, and then apply kmeans to the resulting 1-d data to find
2 clusters. Measure and report its class purity. How does this compare to the results you obtained in
(3)?
5. (10 pts) Provide a discussion on the suitability of PCA and LDA for this particular dataset.
2

More products