Starting from:

$30

Assignment #7 K-Means Clustering

Machine Learning
COMP 5630/ COMP 6630/ COMP 6630 

Assignment #7
K-Means Clustering
Submission Instructions
 Please submit your solutions via Canvas (https://auburn.instructure.com/). You should submit your assignment
as a PDF file. Please do not include blurry scanned/photographed equations as they are
difficult for us to grade.
Late Submission Policy
The late submission policy for assignments will be as follows unless otherwise specified:
1. 75% credit within 0-48 hours after the submission deadline.
2. 50% credit within 48-96 hours after the submission deadline.
3. 0% credit beyond 96 hours after the submission deadline.
1
Tasks
1 K-Means Implementation [50 pts]
In this problem, you will implement Lloyd’s method for the k-means clustering problem and
answer several questions about the k-means objective, Lloyd’s method, and k-means++.
Recall that given a set S = x1, ..., xn → Rd of n points in d-dimensional space, the goal of
the k-means clustering is to find a set of centers c1, ..., ck ∈ R
d
that minimize the k-means
objective:
Xn
j=1
min
i∈{1,...,k}
||xj − ci
||2
which measures the sum of squared distances from each point xj to its nearest center.
Consider the following simple brute-force algorithm for minimizing the k-means objective: enumerate all the possible partitionings of the n points into k clusters. For each
possible partitioning, compute the optimal centers c1, ..., ck by taking the mean of the points
in each cluster and computing the corresponding k-means objective value. Output the best
clustering found. This algorithm is guaranteed to output the optimal set of centers, but
unfortunately, its running time is exponential in the number of data points.
a) [10 pts]: For the case k = 2, argue that the running time of the brute-force algorithm
above is exponential in the number of data points n.
In class, we discussed that finding the optimal centers for the k-means objective is NPhard, which means that there is likely no algorithm that can efficiently compute the
optimal centers. Instead, we often use Lloyd’s method, which is a heuristic algorithm
for minimizing the k-means objective that is efficient in practice and often outputs
reasonably good clusterings. Lloyd’s method maintains a set of centers c1, ..., ck and a
partitioning of the data S into k clusters, C1, ..., Ck. The algorithm alternates between
two steps: (i) improving the partitioning C1, ..., Ck by reassigning each point to the
cluster with the nearest center, and (ii) improving the centers c1, ..., ck by setting ci
to be the mean of those points in the set Ci
for i = 1, ..., k. Typically, these two steps
are repeated until the clustering converges (i.e, the partitioning C1, ..., Ck remains
unchanged after an update). Pseudocode is given below:
(a) Initialize the centers c1, ..., ck and the partition C1, ..., Ck arbitrarily.
(b) Do the following until the partitioning C1,..., Ck does not change:
i. For each cluster-index i, let Ci = {x ∈ S : x is closer to ci than any other center},
breaking ties arbitrarily but consistently.
ii. For each cluster index i, let ci =
1
|Ci|
P
x∈Ci
x.
In the remainder of this problem, you will implement and experiment with of Lloyd’s kmeans clustering algorithm for image segmentation. Specifically, You will work on the
2
“k-means.py” python file along with the image dataset and template package provided
to you. Using PIL, this program will load a selected image, represent each pixel using
its RGB values and analyze pixel-by-pixel RGB values to find the centroid values of the
image. K represents the number of centroids that are initialized randomly within the
min and max RGB values of the image. Once the centroid values have been optimized
using k-means, the program will produce and display the segmented image with the
found RGB centroid values.
(a) [10 pts]: Complete the function assignP ixels(centroids). The input centroids
is a list of current centroids. The function assigns each pixel to the current
centroids for the algorithm. Specifically, this method finds the closest centroid
to the given pixel, then assigns that centroid to the pixel. The function should
return a dictionary clusters where the keys are the unique centroids, and values
are the pixels assigned to each centroid. Note that, this process can reduce the
number of clusters if there are duplicate centroids.
(b) [10 pts]: Complete the function adjustCentroids(clusters) that is used to recenter the centroids according to the pixels assigned to each. A mean average is
applied to each cluster’s RGB values, which are then set as the new centroids.
Output is the list of new centroids.
(c) [10 pts]: Complete the function initializeKmeans(someK) which creates a list
of K centroids and initializes them with the RGB values of randomly selected K
different pixels. That means, first sample K sample pixels, i.e. (x,y) locations,
read their RGB values and used those RGB values for initializing the K centroids.
Finally, return the list of K centroids.
(d) [10 pts]: Complete the function iterateKmeans(centroids) that iterates the kmeans clustering steps for maximum 20 iterations. However, you can stop early if
converged(centroids, old centroids) returns True. Converged is a function provided to you that will help determine if the centroids have converged or not.
2 Analyze the Effect of k on Lloyd’s method [20 pts]
In this part, you will investigate the effect of the number of clusters k on Lloyd’s method
and a simple heuristic for choosing a good value for k. Your dataset will still be the 12
different 2D images provided in the “img” folder.
(a) [10 pts]: Write a short script to plot the k-means objective value obtained for each
value of k in 1, ..., 20. The x-axis of your plot should be the value of k used, and
the y-axis should be the k-means objective. Include both the plot and script in your
written report.
(b) [10 pts]: One heuristic for choosing the value of k is to pick the value at the “elbow”
or bend of the plot produced in part (a), since increasing k beyond this point results
in a relatively little gain. Based on your plot, what value of k would you choose? Does
this agree with your intuition when looking at the data? Why or why not?
3
3 Analyze the Effect of Initialization on Lloyd’s method
[30 pts]
Now, you will investigate the effect of centroid initialization on Lloyd’s method. In particular, you will compare random initialization with Lloyd’s method. Note that the provided
k-means script has the randomized initialization already implemented. In this problem, you
just need to run the code and answer questions.
a) [5 pts]: For test image 5, set k = 2 and run Lloyd’s method 10 times (each time
with random initialization) and save the 10 output plots. What do you see? Can you
explain the output?
b) [5 pts]: For test image 5, set k = 10 and run Lloyd’s method 10 times (each time
with random initialization) and save the 10 output plots. What do you see? Can you
explain the output?
c) [10 pts]: For test image 8, set k = 2 and run Lloyd’s method 10 times (each time
with random initialization) and save the 10 output plots. Pick an image that shows
the highest contrast. Repeat this process for k = {3, 4, 5}. Again, pick the highest
contrast image for each k. Add the highest contrast images to your report. You should
have 4 images.
d) [10 pts]: Based on all these experiments, propose some heuristics for initialization
of centroids (apart from random initialization) which can help achieve meaningful
segmentation results.

More products