Starting from:

$25

Introduction to Machine Learning Homework #2

1 Aim
Please classify the patched images with various methods taught in class up to Chapter
Seven of the textbook. Other than the necessary data preprocessing such as scaling, normalizing etc., it is demanded in Homework #2 assignment to practice cross-validation and
ensemble methods. You may apply new methods or use new packages to improve the
classification performance, but if you do so, you have to give a brief introduction of the
key concepts and provide necessary citations, instead of just direct copy paste or importing. However, in this assignment, you are not allowed to use any neural network related
models (e.g., multilayer perceptron, CNN, etc). In case any neural network related method
is applied, you will receive no credits. Once an algorithm package is merged or imported
into your code, please list the package link in your reference and describe its mathematical
concepts in your report followed by the reason for adoption.
2 Dataset Description
DeepSat(SAT-6) Airborne Dataset is downloaded from https://www.kaggle.com/crawford/
deepsat-sat6 [1][2]. In order to save storage space and speed up the learning process, only
a portion of the original dataset, labeled with ‘building’, ‘grassland’, ’road’, is given in this
1assignment. Each picture is a 28*28 pixel 4-band (red, green, blue and near infrared) image.
The whole dataset is saved as ‘CSV’ files. Here is the dataset format:
• X *.csv: 4-band (‘R’ed,‘G’reen,‘B’lue and near ‘I’nfrared) image data.
{ Each cell represents one pixel value 0 to 255 in ‘R’ed,‘G’reen,‘B’lue or near
‘I’nfrared.
{ Each row is a separated 28*28 pixel 4-band image, which is shown in 1-D array
format
fcolorg frowIdxg fcolIdxg=
[R 0 0, R 0 1, . . . , R 27 27, G 0 0, . . . , G 27 27, B 0 0, . . . , B 27 27, I 0 0, . . . ,
I 27 27].
However, there is no header shown in the ‘CSV’ files.
• y *.csv : label data, where the row indexing matches to that in X *.csv. Each label is
1x3 one-hot encoded vector standing for ‘building’, ‘grassland’ and ‘road.’
You may refer to other’s code on Kaggle [3]. If you are interested, you may also modify
your code or ipynb for the full dataset and submit it to the Kaggle.
3 Submission Format
You have to submit a compressed file hw2 studentID.zip which contains the following
files:
1. hw2 studentID.ipynb: detailed report, Python codes, results, discussion and mathematical descriptions;
2. hw2 studentID.tplx: extra Latex related setting, including the bibliography;
3. hw2 studentID.bib: citations in the ”bibtex” format;
4. hw2 studentID.pdf: the pdf version of your report which is exported by your ipynb
with
(a) %% jupyter nbconvert - -to latex - -template hw2 studentID.tplx
hw2 studentID.ipynb
(b) %% pdflatex hw2 studentID.tex
(c) %% bibtex hw2 studentID
(d) %% pdflatex hw2 studentID.tex
(e) %% pdflatex hw2 studentID.tex
5. Other files or folders in a workable path hierarchy to your jupyter notebook (ipynb).
24 Coding Guidelines
For the purpose of individual demonstration with TA, you are required to create a function code in your jupyter notebook, as specified below, to reduce the data dimensionality,
learn a classification model, and evaluate the performance of the learned model.
• PipelineModel=hw2 studentID demo(in x, in label, mode)
{ in x: [string] CSV file for ‘data’.
{ in label: [string] None or CSV file for ‘label’, which contains labels to the corresponding instances in in x.
{ mode: [string] ‘train’ for building models; ‘test’ for using built model to evaluate
performance.
This function should return a best model trained with cross-validation in your program.
Also, set this pipeline model as global variable. Please note that the HW2 demonstration will be graded based on the final ranking of accuracy. Every demonstration should
be completed within the selected time slot.
If mode=‘train’, please return a PipelineModel trained via cross-validation in your
program. When mode=‘test’, please dump the results to files,
1. hw2 studentID results.csv: save predict labels with the same format as the file
assigned in in label when the mode=‘train’.
2. hw2 studentID performance.csv: show an ‘accuracy’ in ‘%’ in type “float” without
any extra ‘string’ characters.
5 Report Requirement
• List names of packages used in your program;
• Describe the pipeline combinations in your program;
• Describe the cross-validation methods in your program;
• For better explanation, draw flowcharts of the methods or procedures used in the
program;
• Describe the mathematical concepts of any new algorithms or models employed as
well as the roles they play in your feature selection/extraction or classification task in
Markdown cells [4];
• Discuss the performance among different classifiers with/without feature selection/extraction.
35.1 Basic Requirement
• Combine feature engineering and classifier into pipelines [5]. In your program, the
pipeline combinations should at least cover 3 different feature engineerings and 6 different classifiers, including bagging and AdaBoost. There will be more than one pipeline
in your program. Some classifiers can turn into feature engineerings, in that case, you
might need SelectFromModel [6] to merge them as part of feature engineerings in your
pipeline.
• Apply cross-validation method to find better parameter combinations to each pipeline.
If you apply GridSearchCV and the program is halt for a long time, please remove
n job setting. In addition, you could set verbose to make sure your cross-validation is
still running.
• Please make sure hw2 studentID demo is functional and return a trained pipline
model with highest accuracy when mode=‘train’.
• If you apply new methods or use new packages to improve the classification performance, you have to give a brief introduction of the key concepts and provide necessary
citations/links, instead of just direct copy paste or importing.
• Please submit your ‘report’ in English. Be aware that a ‘report’ is much more than a
‘program.’
References
[1] Deepsat(sat-6) airborne dataset in kaggle. https://www.kaggle.com/crawford/
deepsat-sat6. Accessed: 2018-05-01.
[2] Sat-4 and sat-6 airborne datasets. http://csc.lsu.edu/~saikat/deepsat/. Accessed:
2018-05-01.
[3] Other’s code on kaggle. https://www.kaggle.com/crawford/deepsat-sat6/kernels.
Accessed: 2018-05-01.
[4] Markdown. https://daringfireball.net/projects/markdown/basics. Accessed:
2018-03-29.
[5] Pipeline and featureunion: combining estimators. http://scikit-learn.org/stable/
modules/pipeline.html. Accessed: 2018-05-07.
[6] Pipeline and featureunion: combining estimators. http://scikit-learn.org/stable/
modules/generated/sklearn.feature_selection.SelectFromModel.html. Accessed:
2018-05-07.
4

More products