$30
AMS 578
Regression Analysis
Multiple Regression Computing Project
Introduction
The final report is due on Thursday, May 7, 2020, the last day of class. This
project is worth up to 150 points. A preliminary report on the data is due on Thursday,
April 7. The data for the project is in three separate files. Each file name ends with four
numeric characters. Your files are the ones whose last four digits are the same as the last
four digits of your Stony Brook ID Number. Each student must analyze the correct data
set. Failure to use the correct dataset will lead to a grade of zero.
One file contains the patient identifier and the dependent variable value. The
second file contains the patient identifier and values of six environment variables called
E1 to E6. The third file contains the patient identifier and the twenty independent
indicator variables called G1 to G20. The records may not be in correct order in each file,
and cases may be missing in one or more of the files. You can process the data with
VMLOOKUP or other data merging software.
Preliminary Report
Your preliminary report (due April 7) should contain summary statistics on each
of your variables. These summary statistics for a variable before imputation should
include at least the number of observations for that variable, the mean, median, standard
deviation, lower quartile point, upper quartile point, minimum, maximum, and the
number of missing values. The report should include your choice of methodology for
dealing with missing data. You may not use listwise deletion, mean imputation, median
imputation (or any other related technique). You may not delete “outliers.”
Background
The class blackboard has a pdf file of a paper by Caspi et al. that reports a finding
of a gene-environment interaction. This paper used multiple regression techniques as the
methodology for its findings. You should read it for background, as it is the genesis of the
models that you will be given. The data that you are analyzing is synthetic. That is, the
TA used a model to generate the data. Your task is to find the model that the TA used for
your data. For example, one possible model is
2
1 2 8 4 5 6 15 20 (500 5 25 50 100 2 ) Yi
= + E i + G i + E iG i + G iG iG iG i + Zi
.
The class blackboard also contains a paper by Risch et al. that uses a larger
collection of data to assess the findings in Caspi et al. These researchers confirmed that
Caspi et al. calculated their results correctly but that no other dataset had the relation
reported in Caspi et al. That is, Caspi et al. seem to have reported a false positive (Type I
error). The class blackboard contains a recent paper about the genetics of mental illness
and a technical appendix giving the specifics. Together these papers are an example of
the response of the research community to studying the genetics of mental illness, which
is a notoriously difficult research area.
Final Report
Your report should be in standard scientific report format and should be less than
2,500 words. It should contain an introduction, methods section, results section, and a
section with conclusions and discussion. You may add whatever other material you wish
in a technical appendix. The introduction should contain the statement of your problem
(namely estimating the function that the TA used to generate your data). It should discuss
the context of finding GxE interactions, as given by Caspi et al. and others. The methods
section should discuss how you performed your statistical calculations, what independent
variables that you considered, and other methodological issues such as how you chose the
model validation settings and what your model validation procedure was. The results
section should contain an objective statement of your findings. That is, it should contain
the statement of the model that your group proposes for the data, the analysis of variance
table for this model, and other key summary results. The discussion and conclusion
section should include the limitations of your procedures. The class blackboard has an
editorial (by Cummings) that discusses reporting statistical information. The report that
your group submits should be no more than 2500 words with no more than 3 tables and 2
figures. It should include references (which do not count in the 2500 words). The report
may have a technical appendix. It should include your computer programs or describe
your procedures for computation. Your group should include whatever additional
material it feels is necessary to report your results. There are no length restrictions on the
appendix. A submission of only computer output without a report is not sufficient and
will receive a grade of zero. dummy
Analyses that report an incorrect number of observations will also receive a grade
of zero.
Guidelines for analysis
The first task for this problem is to use the statistical package of your choice to
find the correlations between the independent variables and the dependent variable.
Transformations of variables may be necessary. The Box-Cox transformation may find
potentially nonlinear transformations of a dependent variable. After selecting the
transformations of the dependent variable, use model building methods such as stepwise
regression to select the important independent variables. The TA will use at most fourway interactions of the independent variables (that is, terms like
E1E2G2G17
or
G3G4G10G19
) in generating your data. There may also be non-linear environmental
variables, such as
2 E3
or
0.5 E4
.
Hints
Remember to consider multiple testing issues. The p-value for the variables that
you select should be much smaller than 0.01. Remember that you have 6 environmental
variables, 20 genes, 120 gene-environment variables, 190 gene-gene interaction variables,
and so on.
End of Project Assignment