$30
A Mathematical Introduction to Data Science
Homework 2. Random Matrix Theory and PCA
The problem below marked by ∗
is optional with bonus credits. For the experimental problem,
include the source codes which are runnable under standard settings. Since there is NO grader
assigned for this class, homework will not be graded. But if you would like to submit your exercise,
please send your homework to the address (datascience.hw@gmail.com) with a title “CSIC5011:
Homework #”. I’ll read them and give you bonus credits.
1. Phase transition in PCA “spike” model: Consider a finite sample of n i.i.d vectors x1, x2, . . . , xn
drawn from the p-dimensional Gaussian distribution N (0, σ2
Ip×p + λ0uuT
), where λ0/σ2
is
the signal-to-noise ratio (SNR) and u ∈ R
p
. In class we showed that the largest eigenvalue λ
of the sample covariance matrix Sn
Sn =
1
n
Xn
i=1
xix
T
i
pops outside the support of the Marcenko-Pastur distribution if
λ0
σ
2
>
√
γ,
or equivalently, if
SNR >
r
p
n
.
(Notice that √γ < (1 + √γ)
2
, that is, λ0 can be “buried” well inside the support MarcenkoPastur distribution and still the largest eigenvalue pops outside its support). All the following
questions refer to the limit n → ∞ and to almost surely values:
(a) Find λ given SNR >
√γ.
(b) Use your previous answer to explain how the SNR can be estimated from the eigenvalues
of the sample covariance matrix.
(c) Find the squared correlation between the eigenvector v of the sample covariance matrix
(corresponding to the largest eigenvalue λ) and the “true” signal component u, as a
function of the SNR, p and n. That is, find |hu, vi|2
.
(d) Confirm your result using MATLAB, Python, or R simulations (e.g. set u = e; and
choose σ = 1 and λ0 in different levels. Compute the largest eigenvalue and its associated
eigenvector, with a comparison to the true ones.)
1
Homework 2. Random Matrix Theory and PCA 2
2. Exploring S&P500 Stock Prices: Take the Standard & Poor’s 500 data:
https://github.com/yao-lab/yao-lab.github.io/blob/master/data/snp452-data.mat
which contains the data matrix X ∈ R
p×n of n = 1258 consecutive observation days and
p = 452 daily closing stock prices, and the cell variable “stock” collects the names, codes, and
the affiliated industrial sectors of the 452 stocks. Use Matlab, Python, or R for the following
exploration.
(a) Take the logarithmic prices Y = log X;
(b) For each observation time t ∈ {1, . . . , 1257}, calculate logarithmic price jumps
∆Yi,t = Yi,t − Yi,t−1, i ∈ {1, . . . , 452};
(c) Construct the realized covariance matrix Σˆ ∈ R
452×452 by,
Σˆ
i,j =
1
1257
1257
X
τ=1
∆Yi,τ∆Yj,τ ;
(d) Compute the eigenvalues (and eigenvectors) of Σ and store them in a descending order ˆ
by {λˆ
k, k = 1, . . . , p}.
(e) Horn’s Parallel Analysis: the following procedure describes a so-called Parallel Analysis
of PCA using random permutations on data. Given the matrix [∆Yi,t], apply random
permutations πi
: {1, . . . , t} → {1, . . . , t} on each of its rows: ∆Y˜
i,πi(j)
such that
[∆Y˜
π(i),t] =
∆Y1,1 ∆Y1,2 ∆Y1,3 . . . ∆Y1,t
∆Y2,π2(1) ∆Y2,π2(2) ∆Y2,π2(3) . . . ∆Y2,π2(t)
∆Y3,π3(1) ∆Y3,π3(2) ∆Y3,π3(3) . . . ∆Y3,π3(t)
. . . . . . . . . . . . . . .
∆Yn,πn(1) ∆Yn,πn(2) ∆Yn,πn(3) . . . ∆Yn,πn(t)
.
Define Σ = ˜ 1
t ∆Y˜ · ∆Y˜ T as the null covariance matrix. Repeat this for R times and
compute the eigenvalues of Σ˜
r for each 1 ≤ r ≤ R. Evaluate the p-value for each
estimated eigenvalue λˆ
k by (Nk+1)/(R+1) where Nk is the counts that λˆ
k is less than the
k-th largest eigenvalue of Σ˜
r over 1 ≤ r ≤ R. Eigenvalues with small p-values indicate
that they are less likely arising from the spectrum of a randomly permuted matrix
and thus considered to be signal. Draw your own conclusion with your observations
and analysis on this data. A reference is: Buja and Eyuboglu, ”Remarks on Parallel
Analysis”, Multivariate Behavioral Research, 27(4): 509-540, 1992.
3. *Finite rank perturbations of random symmetric matrices: Wigner’s semi-circle law (proved
by Eugene Wigner in 1951) concerns the limiting distribution of the eigenvalues of random
symmetric matrices. It states, for example, that the limiting eigenvalue distribution of n × n
symmetric matrices whose entries wij on and above the diagonal (i ≤ j) are i.i.d Gaussians
N (0,
1
4n
) (and the entries below the diagonal are determined by symmetrization, i.e., wji =
wij ) is the semi-circle:
p(t) = 2
π
p
1 − t
2, −1 ≤ t ≤ 1,
where the distribution is supported in the interval [−1, 1].
Homework 2. Random Matrix Theory and PCA 3
(a) Confirm Wigner’s semi-circle law using MATLAB, Python, or R simulations (take, e.g.,
n = 400).
(b) Find the largest eigenvalue of a rank-1 perturbation of a Wigner matrix. That is, find
the largest eigenvalue of the matrix
W + λ0uuT
,
where W is an n × n random symmetric matrix as above, and u is some deterministic
unit-norm vector. Determine the value of λ0 for which a phase transition occurs. What
is the correlation between the top eigenvector of W + λ0uuT and the vector u as a
function of λ0? Use techniques similar to the ones we used in class for analyzing finite
rank perturbations of sample covariance matrices.
[Some Hints about homework] For Wigner Matrix W = [wij ]n×n, wij = wji, wij ∼ N(0, √σ
n
),
the answer is
eigenvalue is λ = R +
1
R
eigenvector satisfies (u
T vˆ)
2 = 1 −
1
R2