$35
DATA 201: Thinking with Data
Group Project: Data Analysis Pipeline
Goal
This group project helps you learn about the whole data analysis pipeline. You must form a
group by March 20, 2021. All group members must be enrolled in the same lab. If you do not
have a group by that time, your TA will put you in a group. There are three parts in the project.
Technology
Same as Assignments 1-4
Submit
Submit as PDF files in D2L->Assessments->Dropbox.
Contributions
Everyone in the same group will receive the same mark by default. However, if certain
members of a group do not contribute as much as they should have, their marks will be
adjusted according to how much contributions they have to the project.
Part 1. Topic Approval (5 marks)
As a group, you need to decide the topic and main goal of your project. It should be something
that interests you, so you will have fun working on the project and learn something about the
area. For example, a topic can be climate change and the main goal can be finding out which
countries are most affected by climate change.
You should plan ahead and think about where to gather data and the size of your dataset.
Public data including government records, health data, census data, environmental data, etc.
are good candidates. You are also encouraged to collect your own data via surveys. Do NOT
begin gathering data before receiving approval from your TA.
Deliverables
Submit one PDF document that succinctly states the following points (one or two sentences
would be enough):
1. Topic
2. Main goal
3. Method to obtain data
Your TA will let you know whether your topic is approved. Note that you can change your topic
after the due date. However, you will need TA’s approval again. You will receive all 5 marks
once your topic is approved given that you submit the above point on time.
Part 2. Written Report (165 marks)
You will need to obtain data and transform/clean the data so that the dataset will be ready for
analysis; create a set of analysis questions that help you solve the main problem; and answer
the analysis questions with appropriate visualizations.
Deliverables
You need to submit a written report that includes Section A, B, C, D, E, F, and G in one PDF file
via D2L.
There is no hard limit for how long your report should be, but you should provide a clear and
comprehensive discussion of your analysis and findings. You should present your analysis
succinctly, in a way that an outsider can understand. Your job is to convince us that you know
how to work with the dataset and have a clear understanding of each step in the data
analysis pipeline.
Section A: Introduction (20 marks)
Write an introduction of your project. Tell us what your topic and main goal/problem are and
explain why you pick them. Provide detailed background information we need to know in
order to understand your project. Aim for around one page for this section.
Section B: Obtaining Data (10 or 25 marks)
Obtain a dataset that can help you solve the main problem or gain an understanding of the
specific topic (from Part 1). You can choose either Option 1 (download a dataset) or Option 2
(deploy a survey to obtain your data). The data you collected must have at least 50 rows and 10
attributes. In addition, your dataset must be different from other datasets you used in any
other submissions in the course.
Option 1 — download a dataset (10 marks)
You must include:
• The reasons for downloading the particular dataset as opposed to other datasets
• Explanation of how the dataset can help you solve the main problem
• A table of your dataset. If you have more than 50 rows of data, you only need to include
the first 50 rows. If you have more than 10 attributes, you only need to include the 10
most relevant attributes. All content must be legible.
• A link where you downloaded the data so that we can verify it
Option 2 — deploy a survey to collect data (25 marks)
Note that the project is marked out of 200. It means you could potentially earn 15 extra
marks with 215 out of 200 for your project. The extra marks will go towards your overall
grade.
You must include:
• Explanation of how the data collected can help you solve the main problem
• A table of your dataset. If you have more than 50 rows of data, you only need to include
the first 50 rows. If you have more than 10 attributes, you only need to include the 10
most relevant attributes. All content must be legible.
• Your recruitment message -- quality of your message matters
• The consent form
• Screenshots of the whole survey. We need to see what the participants see. This will be
marked based on the quality of your survey.
• Descriptions of your target participants
Important: If you deploy a survey, you must follow the points below. All the points are parts of
the ethics protocol; therefore, must be followed. If any of the points below are not followed,
we will NOT mark this section of your project (Section B) and you will NOT receive any marks
for this section.
1. Use Qualtrics as the survey tool.
2. Use a consent form. You can download the template of the form on D2L. You will need
to customize it with your information (the highlighted parts in the template should be
customized). Do not modify the rest of the consent form.
3. Your recruitment message should contain a link to the consent form or include the form
in the recruitment message.
4. If the participant consent to participate in the survey, they can click a link in the consent
form which directs them to the actual survey; otherwise, directs them to somewhere
else.
5. You will not explicitly be collecting identifying information during the surveys; you will
not be collecting individual names or contact information.
6. You must store survey data on password-protected drives.
7. You must delete the survey data after the grade appeal deadline has passed.
8. You must not share the survey data with anyone other than your group members, TAs,
and instructor.
Make sure you use your dataset ethically and do not violate any copyright laws.
Section C: Analysis Questions (10 marks)
Create 5 analysis questions (in the format of questions) that you will answer. By answering
these questions, you should have a better understanding of your main goal. For example, to
address which countries are most affected by climate change, an analysis question could be
“Which countries had the greatest increase in temperature?”
Section D: Cleaning Data (25 marks)
Perform any data cleaning necessary — reformatting data, removing duplicates, removing
erroneous data points, etc. You can use any programs of your choice to clean the dataset.
You should indicate which attributes are dirty and which are clean. For each dirty attribute, you
must include:
• Explanation of why you think the data needs to be cleaned (e.g., How does it help you in
the later stages of the data analysis pipeline? How can it help you answer the analysis
questions?
• etc.)
• A screenshot showing the dirty data (before cleaning)
• A screenshot showing the clean data (after cleaning) For the clean attributes, you must
include:
• Explanation of why you think they are clean
The screenshots must be legible and clearly show the differences between dirty and clean data.
If you think your data is clean to start off with and do not need any cleaning, you must:
• Clearly explain why you think the data is clean
• Include any procedures that you do to make sure you have obtained clean data
Section E: Visualizations and Findings (100 marks)
You need to analyze your dataset with the help of digital visualizations. For each question, you
must provide:
• A proper visualization created by some visualization tools (e.g., Excel or Tableau). You
are allowed to use visualization tools other than Excel or Tableau; however, we would
not be able to help you if you run into problems with other tools. Note: programs such
as Paint and Inkscape are digital drawing tools; they are not visualization tools.
• The reasons why you choose the specific visualization to help you answer the question
• A discussion of your findings based on the visualization (You can use extra images,
diagrams, annotations to help you explain your answer.)
Section F: Contributions (Maximum of 10 marks can be deducted if this section is incomplete)
List the work done for this project by each group member. You must include all group members
here.
This section is not your formal peer evaluation, but information in this section can be used to
support your claim in your evaluation. Formal peer evaluation can be optionally submitted
separately. Instruction of how to submit it will be provided in a separate document on D2L.
Section G: Reflection (Maximum of 10 marks can be deducted for individual students if this
section is incomplete. Possible mark deduction for individual students would not affect the
group)
This section will be evaluated on individual basis.
Each group member must work on this section on their own. Reflect on what you learned from
assignment 2 (obtaining data), 3 (cleaning data), and 4 (digital visualization). Each member
must answer the questions below:
• Obtaining data
o What could be improved from your assignment 2 submission?
o What did you do better in this project compared to your assignment 2?
• Cleaning data
o What could be improved from your assignment 3 submission?
o What did you do better in this project compared to your assignment 3?
• Digital visualization
o What could be improved from your assignment 4 submission?
o What did you do better in this project compared to your assignment 4?
You should present all answers from one individual student and then all answers from the next
student and so on. Be sure to clearly indicate the students’ names and their answers;
otherwise, we cannot identify who has completed this section.
Note: The way you format and organize your document and answers is a form of presentation
(an important part of the whole data analysis pipeline). It shows how well you can convey your
ideas to your audience. You should make sure your submission is easy to follow and
understand. Up to 10 marks can be deducted for poor organization, formatting, and clarity of
your project.
Part 3. Presentation (30 marks)
Your group will need to conduct a presentation. The presentation must be between 8 and 10
minutes.
Note that you may not have finished all sections of your project at the time of presentation and
that is fine. The presentation can be about the parts that you have done and what you are
planning to do. The goal of the presentation is to let your classmates know what your project
is about and to show the instructor your progress. You will need to decide what to be included
in your presentation. It will be marked based on the contents, the tools you use, and your
presentation skill.
Rubrics
For Each Written Question
80% – 100%: Answers are excellent or only have minor mistakes. Detailed explanations are
provided. Analysis (if applicable or required) must be clear and well-expressed. Include
adequate visual aids if applicable.
60% – 79%: Answers are clear but with obvious mistakes. A decent job overall.
40% – 59%: Missing some important parts of the answers.
1% – 39%: Sloppy or incomplete.
0%: No answer.
You can find the mark breakdown on D2L (DATA201W21Project-Feedback.xlsx).