:
1 Aim
Please identify each of a large number of black-and-white rectangular pixel displays as
one of the 26 capital letters in the English alphabet. You have to use various classification
models taught in class up to Chapter Five of the textbook. Other than the necessary data
preprocessing such as partitioning a dataset into separate training and test datasets and
scaling, you are required to investigate the effectiveness of feature selection/extraction. You
may apply new methods or use new packages to improve the classification performance, but
if you do so, you have to give a brief introduction of the key concepts and provide necessary
citations, instead of just direct copy paste or importing. However, in this assignment, you
are not allowed to use any neural network related models (e.g., multilayer perceptron, LeNet,
etc). In case any neural network related method is applied, you will receive no credits. Once
an algorithm package is merged or imported into your code, please list the package link in
your reference and describe its mathematical concepts in your report followed by the reason
for adoption.
2 Dataset Description
The dataset can be downloaded from UCI Machine Learning Repository
https://archive.ics.uci.edu/ml/datasets/Letter+Recognition.
1There are 20000 instances in the dataset. Each instance has 16 features and one class label.
You can see the dataset information in the above web page.
3 Submission Format
You have to submit a compressed file hw1 studentID.zip which contains the following
files:
1. hw1 studentID.ipynb: detailed report, Python codes, results, discussion and mathematical descriptions;
2. hw1 studentID.tplx: extra Latex related setting, including the bibliography;
3. hw1 studentID.bib: citations in the ”bibtex” format;
4. hw1 studentID.pdf: the pdf version of your report which is exported by your ipynb
with
(a) %% jupyter nbconvert –to latex –template hw1 studentID.tplx
hw1 studentID.ipynb
(b) %% pdflatex hw1 studentID.tex
(c) %% bibtex hw1 studentID
(d) %% pdflatex hw1 studentID.tex
(e) %% pdflatex hw1 studentID.tex
5. Other files or folders in a workable path hierarchy to your jupyter notebook (ipynb).
4 Coding Guidelines
For the purpose of individual demonstration with TA, you are required to create a function code in your jupyter notebook, as specified below, to reduce the data dimensionality,
learn a classification model, and evaluate the performance of the learned model.
• hw1 student ID handwritten(in x, in label, mode, feature engr, f para, classification, c para, other para)
{ in x: [string] csv file or a folder path for handwritten letter image data.
{ in label: [string] csv file or a folder path, which contains labels to the corresponding instances in in x.
{ mode: [string] ’featengr’ for reducing the data dimensionality by feature engineering; ‘training’ for building models; ‘test’ for using built model to evaluate
performance.
2{ feature engr: [None or string] described in Report Requirement.
{ f para: [None or numpy array] default None, declaring necessary parameter(s)
for feature selection/extraction.
{ classification: [None or string] described in Report Requirement.
{ c para: [None or numpy array] default None, declaring necessary parameter(s)
for classification.
{ other para: [None or numpy array] default None, declaring necessary parameter(s) for your program other than the ones for feature engr and classification.
When Mode=\test", please dump the results to files, * hw1 studentID results.csv:
one column with header ‘label’; * hw1 studentID performance.txt: showing the performance (accuracy) in ‘%’. Only output one number of the type “float”, without any extra
‘string’ words.
5 Report Requirement
• List names of packages used in your program;
• Describe the keywords in the argument of your function hw1 student ID handwritten(in x,
in label, mode, feature engr, f para, classification, c para, other para)
{ a list of feature engr methods, for example
∗ None: (default) no feature engineering (selection/extraction)
∗ ‘L1’: L1-regularization feature selection
∗ ‘SFS’: sequential feature selection
∗ ‘Forest’: assessing feature importance with random forest
∗ ‘PCA’: principal component analysis
∗ ‘GKPCA’: Gaussian kernel principal component analysis
∗ ‘LDA’: linear discriminant analysis
∗ and so on;
{ a list of classification methods, for example
∗ None: used when Mode = ’featengr’
∗ ‘SVM’: Support vector machine
∗ ‘GKSVM’: Gaussian kernel Support vector machine
∗ ‘logReg’: Logistic regression
∗ ‘Perceptron’: Perceptron
∗ ‘KNN’: K-nearest neighbors
∗ ‘Decision’: Decision tree
3∗ ‘Forest’: Random forest
∗ and so on;
• For better explanation, draw flowcharts of the methods or procedures used in the
program;
• Describe the mathematical concepts of any new algorithms or models employed as
well as the roles they play in your feature selection/extraction or classification task in
Markdown cells [?];
• Discuss the performance among different classifiers with/without feature selection/extraction.
5.1 Basic Requirement
• Use the original grayscale image data without any feature selection/extraction to do
classification. Then compare results after feature selection (such as L1 regularization,
sequential feature selection, or feature importance assessing with random forest) or
feature extraction (such as PCA, kernel PCA or LDA) is applied.
• All the classifiers taught in class should be investigated and compared by performance.
For SVM, you should investigate both linear-SVM and kernel-SVM. Also for perceptron, logistic-regression and SVM classifiers, you should investigate their stochastic
gradient descent (SGD) versions provided in scikit-learn to handle large datasets [?][?].
• If you apply new methods or use new packages to improve the classification performance, you have to give a brief introduction of the key concepts and provide necessary
citations/links, instead of just direct copy paste or importing.
4