$30
Problem Set 1: Dimensionality Reduction via PCA and Diffusion
Maps
Unsupervised Learning for Big Data
CPSC/AMTH 453 / CBB 555 / CPSC/AMTH 553 / GENE 555 / NSCI 453
1 Introduction
Real world data sets - such as those obtained by high-throughput drug screening or gene sequencing - are
massive, containing hundreds of thousands of measured variables of tens of thousands of observations1
. These
data sets often require exploratory analysis. Common questions include
1. Which measured variables are important in understanding my data set (e.g. explain the variance or
other statistics of observed data points) and which variables are not so important?
2. Do my observed data points form natural clusters and, if so, which observed data points cluster together
and which measured variables best describe these clusters?
3. How can I visualize my observed data points to better understand these relationships and develop a
deeper intuition for my data set (whether it be a measured phenomenon or controlled experiment)?
Somewhat ironically, as we develop technologies to measure a larger amount of variables (i.e. when p is
large and especially when p n), analyzing this high dimensional data and answering these questions
above become much more complicated, both statistically and algorithmically. This is known as the curse of
dimensionality.
In class, we’ve discussed dimensionality reduction techniques to reduce the dimension of data sets from p to
d p while (approximately) preserving various structural properties of the data set. Such techniques include
Principal Component Analysis (PCA), t-SNE, and Diffusion Maps. Aside from their obvious algorithmic
differences, each of these techniques seeks to preserve different properties of the data set and has different
mathematical motivations and assumptions. Once we develop a dimensionality reduction technique, how
do we - as data scientists - experimentally validate these methods? Generally speaking, we first test our
method on artificially constructed data sets that we understand either geometrically or statistically (e.g.
a low-dimensional point cloud, or points drawn from a known distribution) and then see if our method
confirms the known structure in our data set. The next step is to test our method on a real data set that
we understand biologically and see if our method confirms the biological understanding.
2 Understanding the Data Set
Use the provided function read json files() to read the two JSON files containing the data, swiss roll points.json
and swiss roll labels.json. The swiss roll points.json contains 2000 points, each with 3 features and
swiss roll labels.json contains labels for each of the 2000 points, ranging from 0 to 1. Pre-process the
data by centering it (i.e. data points have zero mean). Next, visualize this data set using a scatter plot
where the provided features are x, y, and z coordinates and each point is colored by its label.
1
In this class, we adopt the convention of denoting the number of observed data points by n and denoting the number of
measured variables by p. Thus, a data matrix X whose rows are observations and whose columns are variables has size n × p.
1
Question 2.1. What does this visualization of the swiss roll data set look like? What properties do you
notice about the data set? Are the provided labels meaningful, and if so, in what way? How do you expect a
“good” dimensionality reduction technique to look for the swiss roll data set?
2.1 Visualizing Data with PCA
Now that we have an understanding of the swiss roll data set and how we expect a successful dimensionality
reduction to look, it’s time to experiment. We’ll start with PCA.
1. Run PCA on the swiss roll data set, obtaining the principal components, projections, and singular
values
2. Plot the swiss roll in two dimensions, using the principal components. You might wish to try different
combinations, since there are few of them (i.e. the first and second components, the first and third
components, the second and third components). Color the points with their corresponding labels.
3. Plot the singular values in whatever way you see fit. This could be a plot of the singular values
themselves, their cumulative sum, or the “explained variance” S
2/(n − 1), where S are the singular
values and n is the number of points.
Question 2.2. As a dimensionality reduction technique, to what extent does PCA retain properties of the
swiss roll data set? Can you explain why the visualizations look like this, given how the algorithm works?
What can you learn about the intrinsic dimensionality from the singular values?
3 Implementing Diffusion Maps
In this assignment, you will implement a Diffusion Map and test this dimensionality reduction method on
artificial and real data sets. All code will be written in Python. You are expected to submit a detailed
write-up of your experiments, as well as your code, to the Canvas site by the due date of October 6,
11:59pm . The layout of the assignment is as follows: Section 3 describes the code specifications that you
will write for implementing the Diffusion Map technique, as well as coding tips. Section 4 outlines the first
experiment on the “Swiss Roll”, a popular artificial data set, and Section 6 outlines the second experiment
on a single-cell mass cytometry data set for iPSC Reprogramming. Section 7 contains the grading rubric
and Section 8 contains the submission instructions.
3.1 Algorithm Review
The 2006 paper “Diffusion Maps” by Ronald Coifman and St´ephane Lafon [1] is the most cited paper with
the Diffusion map algorithm, probably because it provides a very extensive analysis. However, the 2005
NeurIPS paper “Diffusion Maps, Spectral Clustering and Eigenfunctions of Fokker-Planck Operators” is
more user-friendly [2] and I recommend reading that paper for more clarification. Let’s quickly review the
algorithm now. Recall that for data points X = {x1, . . . xn}, kernel matrix W with entries defined by the
kernel
κ(xi
, xj ) = exp
−
kxi − xjk
2
σ
2
and diffusion parameter t, the Diffusion Maps algorithm creates an embedding of the points Ψt : X → R
n
based on the right eigenvectors of the powered Markov transition matrix Mt where M = D−1W and D is
the diagonal matrix of row sums. More specifically, let 1 = λ1 ≥ λ2 ≥ · · · ≥ λn ≥ 0 be eigenvalues of M
with right eigenvectors ψ1, ψ2, . . . ψn.
2 Then the diffusion map is given by
Ψt(xj ) =
λ
t
2ψ2(j), λt
3ψ3(j). . . λt
nψn(j)
2
It shouldn’t be immediately obvious that this non-symmetric matrix has positive real eigenvalues, with the largest one being
equal to 1. This can be shown in a few ways, but the most popular way is through the Perron–Frobenius theorem. Proving
this would be a good exercise in linear algebra.
. Recall that the eigenvector with the eigenvalue 1 is the constant vector; for this reason, we call it the trivial
eigenvector and we discard it from our diffusion map since it reveals no information about our data. As it
turns out, computing the eigenpairs λi
, ψi can be done by first computing the eigenpairs of the symmetric
matrix Ms = D−1/2W D−1/2
, denoted λ˜
i
, vi and then taking λi = λ˜
i and ψi = D−1/2vi
. For somewhat
technical reasons, this method is more efficient and numerically stable because Ms is a real symmetric matrix.
It is a good exercise to show that, indeed, the eigenvalues of M and Ms are the same and that D−1/2vi
is a
corresponding eigenvector ψi of M, as advertised. We now summarize the Diffusion Maps algorithm below.
Algorithm 1: Diffusion Maps Algorithm
Input: data set X = {x1, . . . xn}, kernel function κ(˙,
˙
), diffusion parameter t
Output: diffusion mapping Ψt : X → R
n
1 Construct kernel matrix W, Wi,j = κ(xi
, xj );
2 Construct symmetric matrix Ms = D−1/2W D−1/2
;
3 Compute eigenpairs (λ1, v1), . . .(λn, vn) of Ms;
4 Compute (normalized) eigenvectors of M as ψi = D−1/2vi/kD−1/2vik;
5 Construct diffusion map Ψt(xj ) = (λ
t
2ψ2(j), λt
3ψ3(j). . . λt
nψn(j));
6 return Ψt
3.2 Implementation Details
This assignment will be written using Python. For array computations, you should use the numpy package.
You may find the following commands particularly useful (some of these are in the numpy.linalg subpackage): sum, std, exp, abs, zeros, dot, norm, power, svd, and eigh. See the numpy documentation online
for detailed function descriptions. For plotting, please use the matplotlib package, as it is easy to use and
interfaces well with numpy arrays.
I have provided you with a skeleton file, ps1 functions.py, that contains the following empty functions
which you should complete. The purpose of each function is briefly stated in the function header, but you
should review the Diffusion Maps method to remind yourself how these functions should fit together. The
distance matrix calculation and the eigendecomposition are the most computationally expensive components of the algorithm; we have designed the functions in the skeleton file accordingly. The skeleton file
ps1 functions.py contains a function read json files(), which can be used for loading data. You may
add more functions to ps1 functions.py if you feel that they will help you in your experiment, but these
additional functions will not be tested by the grader. You should create new Python files for running your
experiments, outlined in Sections 4 and 6. In these experiment files, you may call functions defined in
ps1 functions.py using the import command in Python. All code should be sufficiently commented so
that someone who understands diffusion maps 3
can easily understand your code.
In Lectures, we introduced PCA as providing p linear combinations of features that best explain the
variance. To obtain the projected data points, we can evaluate the linear combinations of the features of
each data point; however, this is not the most efficient computational method. We mentioned that the
Singular Value Decomposition (SVD), which decomposes a data matrix X = USV T
is an appropriate way
to obtain both the linear combination of features or loadings (which are given in V ) and the projection of
the data points. (which are given as US). When writing your code, you should always center the data and
compute its SVD rather than constructing the covariance matrix and computing its eigendecomposition. For
computing diffusion maps, it is best to use eigh, for reasons described above.
4 Experiment 1: Swiss Roll Dataset
The “swiss roll” data set is a commonly used synthetic data set to test embedding algorithms. The swiss
roll is a good test case because it consists of points sampled from a 2-dimensional manifold. Without
getting overly technical, a set X ⊂ R
n is a 2-dimensional manifold if at every point x ∈ X, a neighborhood
around x looks just like a section of R
2
. Geometrically, you can think about this as a 2 dimensional sheet in 3
dimensional space (or higher). Because the swiss roll has this kind of structure, it’s easy for us to understand
3
for instance, the course grader
3
what a “good” 2 dimensional embedding looks like. A picture would certainly be illuminating here and in
fact, this will be your first task.
4.1 Visualizing Data with Diffusion Maps
Now we will look at diffusion maps as a dimensionality reduction technique. Recall that for each point x in
our data set, Ψt(x) is an n-dimensional vector 4
so when we visualize it, we can only visualize two or three
dimensional slices. Your tasks will be to
1. Construct the diffusion map Ψt of the swiss roll data set using euclidean distance, Gaussian kernel
with width σ = 3.0, and diffusion parameter t = 1.
2. Create two-dimensional scatter plots of the diffusion mapping using different coordinates.
3. Plot the eigenvalues λi of the Markov matrix M
4. Repeat for σ = 1.0 and σ = 6.0
Question 4.1. As a dimensionality reduction technique, to what extent does diffusion mapping retain properties of the swiss roll data set? Can you explain why the visualizations look like this, given how the algorithm
works? What can you learn about the intrinsic dimensionality of the data set from the eigenvalues of M?
How does the choice of Gaussian kernel width σ change the embedding and why?
5 Understanding the First Eigenvector of the Markov Matrix
Now let’s see a curious property of the first left eigenvector of the Markov matrix. Your task will be to
1. Construct the affinity matrix of the swiss roll data set using euclidean distance and Gaussian kernel
with width σ = 1.0.
2. Compute the largest left eigenvector φ1 of M = D−1W, which may be obtained as φ1 = v1D1/2
.
3. Plot the swiss roll in 3 dimensions using the original coordinates and color the points using the corresponding values in φ1.
Question 5.1. How do the values in φ1 correspond to the structure of the swiss roll? Can you explain what
you are seeing in terms of diffusion?
5.1 Using an Adaptive Gaussian Kernel
We have seen the advantages of diffusion maps over PCA in the last portion of the experiment; however, the
Gaussian kernel width σ was fixed for all data points. In this portion of Experiment 1, we’ll see the effect of
using an adaptive Gaussian kernel. To this end, we will use an adaptive k-nearest neighbors kernel, defined
in the following way. Let the k-nearest neighbors kernel κk-nn be given by
κk-nn(xi
, xj ) = 1
2
exp
−
kxi − xjk
2
σk(xi)
2
+ exp
−
kxi − xjk
2
σk(xj )
2
where σk(x) is the distance from x to its kth nearest neighbor. Note that κk-nn satisfies the properties of a
kernel. Let σk-nn = {σ1, . . . , σn} denote these choices of adaptively chosen kernel parameters, for a fixed k.
Now that we have defined this new adaptive kernel, let’s see how it performs differently that a fixed-width
kernel. Your tasks for this portion of Experiment 1 are
1. Construct the diffusion map Ψt of the swiss roll data set using euclidean distance, adaptive k-nearest
neighbor Gaussian kernel with k = 5, and diffusion parameter t = 1.
2. Create two-dimensional scatter plots of the diffusion mapping using different coordinates.
4although in practice, it is common to only compute the first k dimensions. This is easily done with standard eigensolvers.
4
3. Plot the eigenvalues λi of the Markov matrix M
4. Repeat for k = 10
Question 5.2. What are the differences between fixed and adaptive choices of kernel parameters σ? Explain
this difference. What can you learn about the intrinsic dimensionality of the data set from the eigenvalues of
M? How does the choice of nearest neighbor parameter k change the diffusion map? Which kernel method
would you recommend using for the swiss roll data set and why?
5.2 Changing the Diffusion Parameter t
So far, we have kept the diffusion parameter set to t = 1. In this portion of Experiment 1, we will observe the effects of changing t. Note that for observing different t, you can re-use the affinity matrix and
eigendecomposition computations. To this end, your tasks are
1. Construct the diffusion map Ψt of the swiss roll data set using euclidean distance, adaptive k-nearest
neighbor Gaussian kernel with k = 10, and diffusion parameter t = 1.
2. Create two-dimensional scatter plots of the diffusion mapping using different coordinates.
3. Plot the eigenvalues λi of the Markov matrix M
4. Repeat for t = 10, 20, 50
Question 5.3. In terms of diffusion processes, what is the interpretation of increasing t? How do the
diffusion embeddings visually change as t increases? How do the eigenvalues change and how does this help
explain what you see in the embeddings? What can you learn about the intrinsic dimensionality of the data
set from the eigenvalues of M as t increases? Is there a specific value of t that you find most informative
for the Swiss roll data set?
5.3 Final Thoughts
That’s all of the computation we need for understanding these methods on synthetic data. Fortunately, you
should be able to use all the code you’ve written for this experiment for the next experiment with real data.
Before doing that, let’s reflect on the data science aspect of our tests.
Question 5.4. Suppose your boss is very delighted by your implementation of diffusion mappings and the
initial tests on the swiss roll dataset; however, she would like to see how the method works on other artificial
“control” data sets. Describe two other data sets that you might generate and what you would expect to see
from these tests. You do not need to construct these two data sets.
6 Experiment 2: iPSC Reprogramming Dataset
This is a mass cytometry dataset taken from [3] that shows cellular reprogramming from mouse embryonic
fibroblasts (MEFs) to induced pluripotent stem cells (iPSCs) via the introduction of a transcription factor
known as Oct4. This data set measures 33 proteins at the single-cell resolution. The protein markers measure
1. pluripotency or stem-ness, e.g. sox4 oct4
2. differentiation: e.g. nanog and lin28 which are embryonic stem cell (ESC) markers
3. cell-cycle: e.g. pRB which is a cell proliferation marker
4. signaling status: e.g. pakt which is a growth inducing signal and p53 which signals DNA damage and
could induce cell cycle arrest. The “p” preceding some of these markers indicates that the measurement
is for the phosphorylated (usually active) conformation of the protein
5. apoptosis: e.g. ccasp3 which is a caspase marker involved in programmed cell death
5
Additionally a “timepoint” marker shows the day after stimulation with reprogramming factors that the
measurement was taken from. Here you have a mix of timepoints. These are low-granularity time points
(on the order of days), while the actual transition is much finer grained. These protein markers enable the
measurement of the reprogramming status. For instance, a subset of cells shows evidence of being early
reprogramming intermediates with the “correct” set of reprogramming factors Sox2+Oct4+Klf4+Nanog+,
another subset seems successfully reprogrammed with ESC-like lineages expressing markers such as Nanog,
Oct4, Lin28 and Ssea1, and Epcam that are associated with transition to pluripotency.
As before, use the provided read json files() for reading the data set ipsc data set.json and the
channel names channel names.json. The original data set contained observed 220,450 time points; however,
for computational purposes, we have subsampled the data set to contain 2,000 time points. The timepoint
of data point is simply its row number. So the first timepoint is row one, the second timepoint is row two,
and so on. A variety of other pre-processing steps have already been done for you.
6.1 Visualizing Data with PCA
Let’s first explore how PCA can help us understand the processed data set. Your tasks for implementing
PCA here are
1. Run PCA on the processed iPSC data set, obtaining the principal components, projections, and singular
values.
2. Plot the iPSC data set in two dimensions, using the principal components. You might wish to try a
few combinations of the top k = 5 principal components. When plotting the projections, color them
using the time steps.
3. Plot the singular values in whatever way you see fit.
Question 6.1. How do the PCA visualizations look? Are they capturing the time progression of the data
set? Do you see any clusters forming? What is the intrinsic dimensionality given by PCA? What are the
top few channels in the first and second principal directions?
6.2 Visualizing Data with Diffusion Maps
Now let’s do a similar analysis using Diffusion Maps and see what changes. Here, your task will be to
1. Construct the affinity matrix of the iPSC data set using euclidean distance and adaptive k-nearest
neighbors Gaussian kernel with k = 2 and diffusion parameter t = 1.
2. Create two-dimensional scatter plots of the diffusion mapping using different coordinates.
3. Create three-dimensional scatter plot of the diffusion mapping using the first three coordinates for
plotting. Color the points with their corresponding timepoint values.
4. Plot the associated eigenvalues of the Markov matrix.
5. Compute the top 5 channels that have the highest absolute correlation with the first, second, and third
diffusion components.
Question 6.2. How do the Diffusion Map visualizations look? Are they capturing the time progression of
the data set? Do you see any clusters forming? What is the intrinsic dimensionality given by Diffusion
Maps? Which channels are most highly correlated with diffusion components? Any guesses to the biological
interpretation? What are the main differences between Diffusion Maps and PCA methods on the iPSC data
set?
6
6.3 Changing Parameters
In this section, we’ll examine how changing parameters affects the iPSC data set. Your tasks will be
1. Try two different values of diffusion parameter t and visualize the embeddings and the eigenvalues.
2. Try two different values of k for the adaptive kernel and visualize the embeddings and the eigenvalues.
3. For every choice of parameters above, compute the top 5 channels that have the highest absolute
correlation with the first, second, and third diffusion components.
Question 6.3. Did the choice of t affect the embeddings and, if so, how? Do the corresponding eigenvalues
support your findings? How did the choice of k affect the embeddings and, if so, how? Did previously observed
trends (i.e. clusters, time progression) in diffusion dimensions change dramatically? Did the correlated
channels change with σ? How would you interpret this in terms of the data?
7 Grading Rubric
7.1 Implementation – 20 points
Your implementation will be graded by running your submitted ps1 functions.py on several test cases. It
is very important that you do not change the function input / outputs provided in the skeleton file. If
you do, then your functions will not properly handle the test cases and you may receive no credit for the
implementation portion of the assignment. Please adequately comment your code so that partial credit may
be assigned in the event that your code does not pass all test cases.
7.2 Report – 80 points
You must write a detailed report about the experiments that you have run for this homework, including
all plots and visualizations. Your report should address the questions raised here, but does not need to
follow any particular structure or outline. Feel free to add any interesting insights that you had or extra
experiments / validations you ran.
7.3 Cheating Policy
Plagiarism of any kind will not be tolerated in this course. Students are encouraged to discuss general ideas
with each other, but must write code and carry out experiments individually. However, the course staff
(professor, TA, etc.) is happy to provide assistance in all aspects of the assignment, from coding difficulties
to interpretation of experiments. Any plagiarism - whether copying code or sharing experimental analyses -
will not be tolerated.
8 Submission Instructions
Each student should submit a zip file titled [last name] [first name] ps1.zip to Canvas by October
6, 11:59pm , containing
1. A detailed write-up in pdf form, titled [last name] [first name] ps1 report.pdf
2. A subdirectory titled code containing all code used in the assignment
3. A subdirectory titled figures containing all figures used in the write-up
7
References
[1] Ronald R Coifman and St´ephane Lafon. “Diffusion maps”. In: Applied and computational harmonic
analysis 21.1 (2006), pp. 5–30.
[2] Boaz Nadler et al. “Diffusion maps, spectral clustering and eigenfunctions of Fokker-Planck operators”.
In: Advances in neural information processing systems. 2006, pp. 955–962.
[3] Eli R Zunder et al. “A continuous molecular roadmap to iPSC reprogramming through progression
analysis of single-cell mass cytometry”. In: Cell Stem Cell 16.3 (2015), pp. 323–337.
8