$30
Stat 437 Project 2
Your Name (Your student ID)
General rule and information
Due by 11:59PM, April 30, 2021. You must show your work in order to get points. Please prepare
your report according to the rubrics on projects that are given in the syllabus. If a project report
contains only codes and their outputs and the project has a total of 100 points, a maximum of 25
points can be taken off. Please note that your need to submit codes that would have been used for
your data analysis. Your report can be in .doc, .docx, .html or .pdf format.
The project will assess your skills in support vector machines and dimension reduction, for which
visualization techniques you have learnt will be used to illustrate your findings. This project gives
you more freedom to use your knowledge and skills in data analysis.
Task A: Analysis of gene expression data
For this task, you need to use PCA and Sparse PCA.
Data set and its description
Please download the data set “TCGA-PANCAN-HiSeq-801x20531.tar.gz” from the website https:
//archive.ics.uci.edu/ml/machine-learning-databases/00401/. A brief description of the data set is
given at https://archive.ics.uci.edu/ml/datasets/gene+expression+cancer+RNA-Seq.
You need to decompress the data file since it is a .tar.gz file. Once uncompressed, the data files
are “labels.csv” that contains the cancer type for each sample, and “data.csv” that contains the
“gene expression profile” (i.e., expression measurements of a set of genes) for each sample. Here each
sample is for a subject and is stored in a row of “data.csv”. In fact, the data set contains the gene
expression profiles for 801 subjects, each with a cancer type, where each gene expression profile
contains the gene expressions for the same set of 20531 genes. The cancer types are: “BRCA”,
“COAD”, “KIRC”, “LUAD” and “PRAD”. In both files “labels.csv” and “data.csv”, each row name
records which sample a label or observation is for.
Data processing
Please use set.seed(123) for random sampling via the command sample.
• Filter out genes (from “data.csv”) whose expressions are zero for at least 300 subjects, and
save the filtered data as R object “gexp2”.
• Use the command sample to randomly select 1000 genes and their expressions from “gexp2”,
and save the resulting data as R object “gexp3”.
1
• Use the command scale to standardize the gene expressions for each gene in “gexp3”. Save
the standardized data as R object “stdgexpProj2”.
You will analyze the standardized data.
Questions to answer when doing data analysis
Please also investigate and address the following when doing data analysis:
(1.a) Are there genes for which linear combinations of their expressions explain a significant proportion
of the variation of gene expressions in the data set? Note that each gene corresponds to a feature, and
a principal component based on data version is a linear combination of the expression measurements
for several genes.
(1.b) Ideally, a type of cancer should have its “signature”, i.e., a pattern in the gene expressions that
is specific to this cancer type. From the “labels.csv”, you will know which expression measurements
belong to which cancer type. Identify the signature of each cancer type (if any) and visualize it. For
this, you need to be creative and should try both PCA and Sparse PCA.
(1.c) There are 5 cancer types. Would 5 principal components, obtained either from PCA or Sparse
PCA, explain a dominant proportion of variability in the data set, and serve as the signatures of
the 5 cancer types? Note that the same set of genes were measured for each cancer type.
Identify patterns and low-dimensional structures
Please implement the following:
(2.a) Apply PCA, determine the number of principal components, provide visualizations of lowdimensional structures, and report your findings. Note that you need to use “labels.csv” for the task
of discoverying patterns such as if different cancer types have distinct transformed gene expressions
(that are represented by principal components). For PCA or Sparse PCA, low-dimensional structures
are usually represented by the linear space spanned by some principal components.
(2.b) Apply Sparse PCA, provide visualizations of low-dimensional structures, and report your
findings. Note that you need to use “labels.csv” for the task of discoverying patterns. Your
laptop may not have sufficient computational power to implement Sparse PCA with many principal
components. So, please pick a value for the sparsity controlling parameter and a value for the
number of principal components to be computed that suit your computational capabilities.
(2.c) Do PCA and Sparse PCA reveal different low-dimensional structures for the gene expressions
for different cancer types?
Task B: analysis of SPAM emails data set
For this task, you need to use PCA and SVM.
2
Dataset and its description
The spam data set “SPAM.csv” is attached and also can be downloaded from https://web.stanford.
edu/~hastie/CASI_files/DATA/SPAM.html. More information on this data set can be found at:
https://archive.ics.uci.edu/ml/datasets/Spambase. The column “testid” in “SPAM.csv” was used
to train a model when the data set was used by other analysts and hence should not be used as
a feature or the response, the column “spam” contains the true status for each email, and the
rest contain measurements of features. Here each email is represented by a row of features in the
.csv file, and a “feature” can be regarded as a “predictor”. Also note that the first 1813 rows, i.e.,
observations, of the data set are for spam emails, and that the rest for non-spam emails.
Data processing
Please do the following:
• Remove rows that have missing values. For a .csv file, usually a blank cell is treated as a
missing value.
• Check for highly correlated features using the absolute value of sample correlation. Think
about if you should include all or some of highly correlated features into an SVM model. For
example, “crl.ave” (average length of uninterrupted sequences of capital letters), “crl.long”
(length of longest uninterrupted sequence of capital letters) and “crl.tot” (total number of
capital letters in the e-mail) may be highly correlated. Whethere you choose to remove some
highly correlated features from subsequent analysis or not, you need to provide a justification
for your choice.
Note that each feature is stored in a column of the original data set and each observation in a row.
You will analyze the processed data set.
Classifiction via SVM
Please do the following:
(3.a) Use set.seed(123) wherever the command sample is used or cross-validation is implemented,
randomly select without replacement 300 observations from the data set and save them as training
set “train.RData”, and then randomly select without replacement 100 observations from the
remaining observations and save them as “test.RData”. You need to check if the training set contains
observations from both classes; otherwise, no model can be trained.
(3.b) Apply PCA to the training data “train.RData” and see if you find any pattern that can be
used to approximately tell a spam email from a non-spam email.
(3.c) Use “train.RData” to build an SVM model with linear kernel, whose cost parameter is
determined by 10-fold cross-validation, for which the features are predictors, the status of email is
the response, and cost ranges in c(0.01,0.1,1,5,10,50). Apply the obtained optimal model to
“test.RData”, and report via a 2-by-2 table on spams that are classified as spams or non-spams and
on non-spams that are classified as non-spams or spams.
(3.d) Use “train.RData” to build an SVM model with radial kernel, whose “cost” parameter is
determined by 10-fold cross-validation, for which the features are predictors, the status of email is
3
the response, cost ranges in c(0.01,0.1,1,5,10,50), and gamma=c(0.5,1,2,3,4). Report the
number of support vectors. Apply the obtained optimal model to “test.RData”, and report via a
2-by-2 table on spams that are classified as spams or non-spams and on non-spams that are classified
as non-spams or spams.
(3.e) Compare and comment on the classification results obtained by (3.c) and (3.d).
4