$29.99
COMP3430 / COMP8430 – Data Wrangling –
Assignment 1
Worth 10% of the final grade for COMP3430 / COMP8430
Overview and Objectives
This assignment covers the topics of data quality, data exploration, and data profiling as presented in the first few weeks of
the course. It also includes questions about what data wrangling is, why it is important, and how it fits into the broader field
of data analytics. One task refers to the required readings from week 1 of the course while others ask you about practical
aspects of data exploration.
Important
The answers to this assignment have to be submitted online in Wattle, see the link Assignment 1 Submission in week
6 (30 August to 3 September).
Follow instructions given for maximum text length in free format answers. If your answer is too long it will attract a
penalty (for details see the individual questions below and the corresponding answer submission forms in Wattle).
You can edit your answers many times and they will be saved by Wattle.
Make sure you submit the final version of your assignment answers before the submission deadline.
Note that Wattle does not allow us to access any earlier edited versions of your answers, so check very
carefully what you submit as the final version!
IMPORTANT: You can only submit your assignment once!
Make sure you do not forget to submit your assignment!
Penalties
Textual questions have maximum line and maximum word limits. If you write more than these provided limits we
will have to apply an over-word-limit penalty. For details of limits see the individual questions below and the corresponding
pages in the assignment submission in Wattle.
Deadlines, Extensions and Late Submissions
The assignment is due 11:55 pm on Friday 3 September 2021.
Students will only be granted an extension on the submission deadline in extenuating circumstances, as defined by ANU
policy (http://www.anu.edu.au/students/program-administration/assessments-exams/deferred-examinations).
If you think you have grounds for an extension, you must notify the course convener as soon as possible and
provide written evidence in support of your case (such as a medical certificate). The course convener will then decide
whether to grant an extension and inform you as soon as practical.
In accordance with the CECS and ANU late submission policy, no late submissions will be accepted, except where an
extension has been approved by the course convener.
Assignment Structure
The assignment consists of four (4) tasks as described below which can be worth different numbers of marks. Make sure you
answer all aspects of each task.
If you have any questions on the assignment please post them on Wattle – however do not post any partial solutions,
program codes, URLs, etc., or any hints on how to solve any of the assignment tasks.
Marking
This assignment will be marked out of 10, and it will contribute 10% of your final course mark.
Note that not all tasks and questions are equally difficult. For some of the tasks there is no single right or wrong answer.
Marks will be awarded based on your reasoning and the justification of your decisions and explanations, as well as clarity
and correctness of writing.
We will endeavour to release your marks and feedback within two teaching weeks after the submission deadline. If you feel
we have made an error in marking, you have two weeks following the release of marks to raise any issues with the course
convener, after which time your mark will be considered final. If you request that we re-mark your assignment, we
will re-mark the entire assignment and your mark may go up or down as a result.
Data Set Generation for this Assignment and for Assignment 2
For this assignment and the upcoming Assignment 2 each of you will work on an individual data set that will be based on a
master data set we will provide, and a data generation program we will also provide.
Note that we have generated the master data set based on real data (such as lookup tables of names, addresses, etc.), and
we have then corrupted and modified certain aspects of that data set. We have intentionally tried to include the types of
relationships, features, errors, and other data quality issues that you might find in real data sets. Any similarity to real
persons or places is entirely coincidental.
Download the master data set from Wattle (to be made available in week 2) named dw assignment master.csv.gz, and
the data generation program named generate-student-dataset.py. Copy both these files into one folder / directory, and
run the code using Python 3 in the following way:
python3 generate-student-dataset.py your ANU ID dw assignment master.csv.gz
The program will generate an output data set named data wrangling medical 2021 your ANU ID.csv.gz, and print
some output which contains the following important lines (for the example ANU ID u1234567):
>>> python3 generate-student-dataset.py u1234567 dw assignment master.csv.gz
Your student data set for the data wrangling 2021 assignments has been generated and written into file:
data wrangling medical 2021 u1234567.csv
Your ANU ID check code is: d76225bc
Your student data set check code is: 216b3fef9401
*** Check this pair of numbers is in the list provided on Wattle, if not contact the course convenor.
Important
Write down your two check codes because you must provide them with the assignment submission. This
will allow us to validate that you have generated and used the correct data set.
Check that the pair of check codes you get (like in the example above d76225bc and 216b3fef940) is in the list of
check codes we will provide on Wattle (in week 2 under the assignment 1 document). This will allow you to check
that you have generated the correct data set.
You must use your individual generated data set for task 4 of this assignment (and the tasks on data
cleaning in Assignment 2).
Assignment Tasks
Task 1 (2 marks):
According to the paper (from week 1) by Rahm and Do (Data Cleaning: Problems and Current Approaches), data cleaning
generally deals with detecting errors and inconsistencies from data to improve the quality of data. As mentioned in this
paper, there are many issues and problems related to data cleaning.
Answer the following two questions each in 10 or less lines of text (a maximum of 250 words each), where one text entry
will be provided in Wattle per question.
(1) Do you think the problems and issues related to data cleaning raised in this paper (in the year 2000) are still relevant
today? Justify why or why not?
(2) Imagine you are hired by the Australian Federal Department of Health as a data wrangler to deal with incoming
data sets about COVID-19 cases (details of patients who were diagnosed with the virus) from the seven Australian states
and territories. Your task is to clean and integrate these data sets to support the decision making by the Australian
government.
Briefly describe three (3) data wrangling aspects you will have to consider when dealing with such data sets.
Task 2 (1 mark): Following is a list L of age values (in years) of a group of people:
L = [74, 14, 20, 32, 42, 55, 91, 56, 84, 42, 13, 7]
First, split your ANU ID (excluding the first character ‘u’) into four number segments (three pairs and a single number)
and then add these four number segments to L. For example if your ANU ID is u1204067 then split it into: 12, 04, 06, 7
and add these numbers to L, so the final list becomes: L = [74, 14, 20, 32, 42, 55, 91, 56, 84, 42, 13, 7, 12, 4, 6, 7].
Now calculate and enter into the corresponding answer fields on Wattle:
1. the mean and standard deviation of L,
2. the median and median absolute deviation of L, and
3. the mode of L.
Task 3 (2 marks): Apply binning as covered in the lectures to the numbers in the list L as generated in the previous
task (i.e. L including the number segments based on your ANU ID appended).
Calculate and enter into the corresponding answer fields on Wattle the results when binning L using:
1. equal depth with two bins and smoothed by bin median,
2. equal width with three bins and smoothed by bin mean,
3. equal width with four bins and smoothed by bin boundaries, and
4. equal depth with four bins and smoothed by bin boundaries.
Clearly show the bins you generated when you enter your answers into Wattle answer fields by showing one bin per line,
for example (assume we have binned [1,2,3,4,5,6,7,8,9] into three bins with smoothing by bin medians):
Bin 1: [2, 2, 2]
Bin 2: [5, 5, 5]
Bin 3: [8, 8, 8]
Task 4 (5 marks):
For the last task of this assignment you must use the data set you generated as per instructions above. We ask
you to explore this data set using tools of your choice (Rattle, R, Python, Pandas, etc.) and answer the specific questions
about this data set given below.
Make sure to follow the instructions on the individual Wattle answer fields with regard to rounding, the
number of digits to provide after the decimal point, etc.
1. Provide the missingness patterns of values (as we discussed in the labs) for the three attributes: postcode, phone,
and email.
2. Calculate the correlation between the attributes (a) BMI and age at consultation, (b) BMI and height, and (c)
state and valid marital status. In your answers you need to provide the numerical correlation value, the name
of the correlation method you used, and a brief (one sentence) explanation why you used that specific correlation
function for each pair of attributes.
3. For the following attributes, calculate numerical values for the following data quality dimensions:
(a) Completeness for postcode and phone.
(b) Validity for weight.
(c) Uniqueness for last name.
(d) Consistency between age at consultation and birth date (for valid age values).
4. Calculate the distributions of the first digits (Benford’s law) for the attributes (a) cholesterol level, (b) blood pressure
and (c) medicare number. Describe for each in one or two sentences if it does follow Benford’s law or not, and
why you think it does or does not follow this law.
5. Describe in a few sentences three other unusual characteristics you can identify in this data set using data exploration
and profiling.
You will receive up-to one mark for correctly answering each of these questions, where both correct numerical values as
well as correct and clearly written justifications of your answers will be considered.
Other Aspects
For all textual answers in this assignment, English writing mistakes and typographical errors will attract small penalties.