Assignment 3: Kaggle Competition

Your shopping cart is empty.

COMP-309-18/Ass3
COMP 309 — Machine Learning Tools and Techniques
Assignment 3: Kaggle Competition

1 Objectives
The goal of this assignment is to help you tie together all the concepts you have learnt in the first half of this course
in the lectures and assignments. To aid you in completing this assignment, you should review the major aspects of
the course that have been explored so far, such as:
• Data understanding, cleansing, and pre-processing,
• Machine learning concepts,
• CRISP-DM and pipelines in general,
• Feature manipulation, including feature selection, feature construction and imputation,
• Statistical design and analysis of results.
These topics are (to be) covered in lectures 01–12. Research into online resources for AI is encouraged, where the
rabbit-hole1 will provide useful jumping off points for further exploration.
2 Question Description
“The footprints of past travellers have marked out a network of scenic trails that attract trampers from around the
world.” [https://teara.govt.nz/en/walking-tracks]
Trampers often use guides to walking times on these tracks. “If you’ve spent any amount of time in New Zealand’s
conservation estate track network, it’s likely you’ve experienced inconsistencies in the accuracy of [walking] times ...
the estimated times aren’t pulled out of thin air so much as the mish-mash variety of different methods that various
conservancies use. ” [http://www.windy.gen.nz/index.php/archives/725]. As a COMP309 student, you have been
tasked with showcasing how ML tools can be used to see if there are relationships between the data and estimated
times.
The overall aim of this assignment is to develop the best possible machine learning system to predict the completion times of given tracks. The hope is that officials will be able to use your model to better understand the
factors behind track time estimation.
We have set up a Kaggle InClass Competition2
to facilitate finding the best machine learning system for officials to
use. You will be expected to analyse the provided data, design and improve your own machine learning pipeline, and
consider the consequences of applying your pipeline to this data.
Note the data is real. Thus, you could get a stopwatch, then walk every track to estimate the times or find the original
dataset and create a look-up table. Neither of these extremes are permitted as they miss the point of the course.
We want to see the model produced with the understanding of the patterns, rather than get the times themselves.
1https://ecs.victoria.ac.nz/Courses/COMP309_2019T2/RabbitHole
2https://www.kaggle.com/about/inclass/overview
COMP309-T2, 2019 1 Ass3
2.1 Preliminary: Accessing the Kaggle InClass Competition
To access the class competition, you must use the below url. Please do not share this publicly as it will allow
anybody to access our competition, which will make the experience less enjoyable for your classmates. Deliberate
cheating is a disciplinary matter, so please don’t go there either.
CORE Competition link: https://www.kaggle.com/t/8484016fd0014e7fbfa497c90a4eba89
COMPLETION Competition link: https://www.kaggle.com/t/374ed145405f482181aa50f375e3b809
You will need to register a Kaggle account. It is perfectly fine (and expected) to use a pseudonym as your Kaggle
username so your classmates do not know your real-life identity. However, you will need to fill out the following
form so that the lecturers and tutors can link your Kaggle result to your ECS account. No other people will have
access to this information! Each time you change your Username, please update the form (it would make sense to
do this only once, but past experience suggests that students will change their Username a few times).
Please fill out the following form: https://goo.gl/forms/Hvj2AQqf6o3zRDfQ2
Please submit as part of your report.
Once you have completed the above steps, please verify that you can access the following page:
https://www.kaggle.com/c/comp309-2019-core/overview (when logged in).
Once successful, you may proceed to the rest of the assignment!
We have created two competitions: a tutorial competition (held within the Core component) and the actual competition (held within the Challenge component). The tutorial competition exists to help people gain familiarity with
the Kaggle process. Please ask as many questions as you need regarding this Core process in helpdesks or on the
forum. The actual competition in the Challenge component exists to test how well people have learnt, understood
and can apply the ML tool processes presented in the first half of the course. There will only be limited assistance
in the helpdesk.
2.2 Core: Exploring and understanding the Kaggle process [50 marks]
We have created a data processed version of the tramping data. This is to be used in classification competition. We
have split the data into training and test set.
CORE Competition link: https://www.kaggle.com/t/8484016fd0014e7fbfa497c90a4eba89
The training set is to be used to create your model. You can use any machine learning tool, e.g. WEKA, SciKit
Learn. Your model will need to be able to predict the class of future test data.
Part of the test data is to be used by Kaggle to test your model in order to create the public leaderboard. The
remaining part of the test set is used by Kaggle to test your model for the private leaderboard. You will submit
your complete answer to the test set but will not know which instances have been used for the public or private
leaderboard (this helps prevent over-fitting and gaming the system).
Your file will need to consist of two columns. The ID and the estimated time class [0, 1, 2] only. The csv should
include your predictions for all 123 instances in the dataset, plus a header line (124 lines). For example:
ID, Time
1, 0
2, 2
3, 1
...
123, 2
COMP309-T2, 2019 2 Ass3
Requirements
Using any tools you find useful, you should explore and analyse the dataset. You should draw upon your previous
experiences and what you have learnt in this course to find a number of interesting patterns. You may wish to start
by examining the quality, completeness and representation of individual features.
In your report (to be submitted electronically), you should spend approximately two pages describing the following
regarding the Core part:
• (25 marks) Highlight the findings of your dataset exploration. You should identify each pattern clearly using
examples, and discuss the potential consequence this may have on your results. To achieve a high mark, you
should consider more complicated patterns, such as feature interactions.
• (25 marks) Visualisation is an important aspect of this task. Please illustrate at least one important finding of
your work.
2.3 Completion: Developing and testing your machine learning system [40 marks]
It is often much more effective to first learn about the properties of a dataset (business and data understanding)
before applying machine learning to it. You should begin by familiarising yourself with the dataset by reading the
“Overview” and “Data” tabs of the Kaggle competition. Please download the dataset from the Data tab (in .csv
format). You should now spend some time examining the data and taking notes of any interesting patterns you find.
Two more datasets are provided, where merging will assist in the knowledge extraction. You will need to clean the
data. Additional datasets can be merged to assist your predictions provided they are do not contain the ground-truth
answers for the test set. Now that you have some initial understanding of the tracks dataset, you should design an
initial system (model) that you consider has the potential to accurately predict the time taken to navigate a track.
You may use any ML tools you wish, but a good solution will consider a number of factors, such as: pre-processing
steps, the properties of the dataset and generalisation/over-fitting. Decisions around how to split your labelled data
into training/testing/cross-validation set/s are your choice, which are important and should be explained.
COMPLETION Competition link: https://www.kaggle.com/t/374ed145405f482181aa50f375e3b809
The final output of your system should be a single csv file containing two columns that represent an instance’s unique
ID and your predicted time. The csv should include your predictions for all 100 instances in the dataset, plus a
header line (101 lines). For example:
ID, Time
1, 30
2, 480
3, 90
...
100, 200
Once you are satisfied with your initial attempt (do not spend too long on it!), you should upload your output csv to
the submissions page of the competition: https://www.kaggle.com/c/comp309-2019-completion/submissions.
Once your submission has been processed, you will be able to see your classification accuracy on the public leaderboard. You should use this feedback to further improve your system. For example, if your leaderboard performance is
much lower than on your own test set, you have over-fitted your model. You should use your judgment to decide how
extensively to change your system. This may only be tweaking parameters, or you may decide to try a completely
different algorithm. Note that you are limited to 4 submissions per day, but submitting this many may be to your
detriment as you may “over-train” on the public leaderboard!
Do not be discouraged if your performance appears low on the leaderboard: we are interested in novel/interesting
solutions even if they have lower performance, and you may come out on top on the private leaderboard anyway.
COMP309-T2, 2019 3 Ass3
Requirements
You should refine your machine learning system a number of times (at least 3, including the initial system) based
on the performance you achieve on the public leaderboard. Your submitted report should contain up to 4 pages
regarding the challenge component:
• (10 marks) Discuss the initial design of your system, i.e. before you have submitted any predictions to the
Kaggle competition. Justify each decision you made in its design, e.g. reference insight you gained in the Core
part.
• (20 marks) Discuss the design of one or more of your intermediary systems. Justify the changes you made to
the previous design based on its performance on the leaderboard, and from any other additional investigation
you performed.
• (10 marks). Use your judgement to choose the best system you have developed — this may not necessarily
be the most accurate system on the leaderboard. Make sure you select this submission as your final
one on the competition page before the deadline. Explain why you chose this system, and note any
particularly novel/interesting parts of it. You should submit the source and executable code required to run
your chosen submission so that the tutors can verify its authenticity. We will select a subset of students to
demonstrate in person in ECS computer laboratory, so please make sure your work does not just run on a
private desktop machine.
2.4 Challenge: Reflecting on your findings [10 marks]
Until now, we have been focusing on achieving the best performance possible — but consider whether this is all that
ML tool users should consider?3
The dataset has a number of features that could be used to produce a machine learning system that, while accurate,
has a number of biases towards certain population groups. Should the officials be worried that using your model in
their analyses could be harmful to society?
Requirements
You should consider the interpretability of your final chosen model from the Challenge part, and analyse any ethical
concerns associated with its structure.
Your report (1-2 pages on the Challenge component) should address the following questions:
• (10 marks) How easy is it to interpret your chosen machine learning model? Discuss any ethical consequences
of how it uses the chosen features to make a prediction. For example, consider how your model could assist a
tramper who uses a wheelchair?
3 Relevant Data Files and Kaggle Information
The datasets, and additional information about the Kaggle competition can be found online:
https://www.kaggle.com/c/comp309-2019-core/
And https://www.kaggle.com/c/comp309-2019-completion/
3https://towardsdatascience.com/can-a-machine-be-racist-5809b18e5a91
COMP309-T2, 2019 4 Ass3
4 Assessment
We will endeavour to mark your work and return it to you as soon as possible, hopefully in 2 weeks. Your position on
the private leaderboard will help improve the final grade for the top twenty students in the Completion competition
only, but will not be the main consideration. The tutor(s) will run a number of helpdesks to provide assistance in
the assignment to answer any questions regarding what is required.
5 Submission Guidelines
5.1 Submission Requirements
1. Programs for all individual parts. To avoid confusion, all the individual parts should use directories Core/,
Completion/, ... and all programs should be stored in their corresponding directories. Within each directory,
please provide a readme file that specifies how to compile and run your programs on the ECS School machines.
A script file called sampleoutput.txt should also be provided to show how your program run properly. If you
programs cannot run properly, you should provide a buglist file. Ensure you submit your chosen solution in
a form that the tutors can understand and run easily.
2. A document that consisting of the report of all the individual parts. The document should mark each part
clearly. The document can be written in PDF, text or the DOC format.
5.2 Submission Method
The programs and the PDF version of the document should be submitted through the web submission system from
the COMP309 course web site by the due time.
There is NO required hard copy of the documents.
KEEP a backup and receipt of submission.
Submission should be completed on School machines, i.e. problems with personal PCs, internet connections and lost
files, which although eliciting sympathies, will not result in extensions for missed deadlines.
5.3 Late Penalties
The assignment must be handed in on time unless you have made a prior arrangement with the lecturer or have a
valid medical excuse (for minor illnesses it is sufficient to discuss this with the lecturer). The penalty for assignments
that are handed in late without prior arrangement is one grade reduction per day. Assignments that are more than
one week late will not be marked.
COMP309-T2, 2019 5 Ass3

Shopping cart

US$0

Assignment 3: Kaggle Competition

More products