$29.99
ISYE 6740, Homework 2
100 points + 15 bonus points
1. PCA: Food consumption in European countries [50 points]
The data food-consumption.csv contains 16 countries in Europe and their consumption for 20
food items, such as tea, jam, coffee, yogurt, and others. We will perform principal component
analysis to explore the data. In this question, please implement PCA by writing your own
code (you can use any basic packages, such as numerical linear algebra, reading data, in your
file).
First, we will perform PCA analysis on the data by treating each country’s food consumption as their “feature” vectors. In other words, we will find weight vectors to combine
20 food-item consumptions for each country.
1. (10 points) For this problem of performing PCA on countries by treating each country’s
food consumption as their “feature” vectors, explain how the data matrix is set-up in
this case (e.g., the columns and the rows of the matrix correspond to what).
2. (10 points) Suppose we aim to find top k principal components. Write down the
mathematical optimization problem for solving this problem (i.e., PCA optimization
problem). Show why the first principal component is obtained by using a weight vector
corresponding to the eigenvectors associated with the largest eigenvalue. Explain how
to find the rest of the principal components.
3. (10 points) Now assume k = 2, i.e., we will find the first two principal components
for each data point. Find the weight vectors w1 and w2 to extract these two principal
components. Plot these two weight vectors, respectively (e.g., in MATLAB, you can
use stem(w) to plot the entries of a vector w; similar things can be done in Python).
Explain if you find any interesting patterns in the weight vectors.
4. (10 points) Now extract the first two principal components for each data point (thus,
this means we will represent each data point using a two-dimensional vector). Draw a
scatter plot of two-dimensional representations of the countries using their two principal
components. Mark the countries on the lot (you can do this by hand if you want).
Please explain any pattern you observe in the scatter plot.
1
Now, we will perform PCA analysis on the data by treating country consumptions as
“feature” vectors for each food item. In other words, we will now find weight vectors to
combine country consumptions for each food item to perform PCA another way.
5. (10 points) Project data to obtain their two principle components (thus, again each
data point – for each food item – can be represented using a two-dimensional vector).
Draw a scatter plot of food items. Mark the food items on the plot (you can do this by
hand if you do not want). Please explain any pattern you observe in the scatter plot.
2. Order of faces using ISOMAP [50 points]
This question aims to reproduce the ISOMAP algorithm results in the original paper for
ISOMAP, J.B. Tenenbaum, V. de Silva, and J.C. Langford, Science 290 (2000) 2319-2323
that we have also seen in the lecture as an exercise (isn’t this exciting to go through the
process of generating results for a high-impact research paper!)
The file isomap.mat (or isomap.dat) contains 698 images, corresponding to different poses
of the same face. Each image is given as a 64 × 64 luminosity map, hence represented as a
vector in R
4096. This vector is stored as a row in the file. [This is one of the datasets used in
the original paper] In this question, you are expected to implement the ISOMAP algorithm
by coding it up yourself. You may use the provided functions in ShortestPath.zip to find the
shortest path as required by one step of the algorithm.
Choose the Euclidean distance (i.e., in this case, a distance in R
4096) to construct the
nearest neighbor graph—vertices corresponding to the images. Construct a similarity graph
with vertices corresponding to the images, and tune the threshold so that each node has
at least 100 neighbors (this approach corresponds to the so-called -Isomap).
(a) (10 points) Visualize the similarity graph (you can either show the adjacency matrix,
or similar to the lecture slides, visualize the graph using graph visualization packages
such as Gephi (https://gephi.org) and illustrate a few images corresponds to nodes
at different parts of the graph, e.g., mark them by hand or use software packages).
(b) (20 points) Implement the ISOMAP algorithm yourself to obtain a k = 2-dimensional
embedding. This means, each picture is represented by a two-dimensional vector (Z
in the lecture), which we called “embedding” of pictures. Plot the embeddings using a
scatter plot, similar to the plots in lecture slides. Find a few images in the embedding
space and show what these images look like. Comment on do you see any visual
similarity among them and their arrangement, similar to what you seen in the paper?
(c) (10 points) Now choose `1 distance (or Manhattan distance) between images (recall
the definition from “Clustering” lecture)). Repeat the steps above. Use -ISOMAP to
obtain a k = 2 dimensional embedding. Present a plot of this embedding. Do you see
any difference by choosing a different similarity measure by comparing results in Part
(b) and Part (c)?
2
(d) (10 points) Perform PCA (you can now use your implementation written in Question
1) on the images and project them into the top 2 principal components. Again show
them on a scatter plot. Explain whether or you see a more meaningful projection using
ISOMAP than PCA.
3. (Bonus) Eigenfaces and simple face recognition [15 points]
This question is a simplified illustration of using PCA for face recognition. We will use a
subset of data from the famous Yale Face dataset. Remark: You will have to perform
downsampling of the image by a factor of 4 to turn them into a lower resolution image as
a preprocessing (e.g., reduce a picture of size 16-by-16 to 4-by-4). In this question, you can
implement your own code or call packages.
First, given a set of images for each person, we generate the eigenface using these images.
You will treat one picture from the same person as one data point for that person. Note that
you will first vectorize each image, which was originally a matrix. Thus, the data matrix
(for each person) is a matrix; each row is a vectorized picture. You will find weight vectors
to combine the pictures to extract different “eigenfaces” that correspond to that person’s
pictures’ first few principal components.
1. (10 points) Perform analysis on the Yale face dataset for Subject 1 and Subject 2, respectively, using all the images EXCEPT for the two pictures named subject01-test.gif
and subject02-test.gif. Plot the first 6 eigenfaces for each subject. When visualizing, please reshape the eigenvectors into proper images. Please explain can you see
any patterns in the top 6 eigenfaces?
2. (5 points) Now we will perform a simple face recognition task.
Face recognition through PCA is proceeded as follows. Given the test image subject01-
test.gif and subject02-test.gif, first downsize by a factor of 4 (as before), and vectorize
each image. Take the top eigenfaces of Subject 1 and Subject 2, respectively. Then we
calculate the normalized inner product score of the 2 vectorized test images with the
vectorized eigenfaces:
sij =
(eigenface)
T
i
(test image)j
k(eigenfacei
)k · k(test image)j
k
Report all four scores: sij , i = 1, 2, j = 1, 2. Explain how to recognize the faces of the
test images using these scores. Explain if face recognition can work well and discuss
how we can improve it, possibly.
3