$30
605.649 — Introduction to Machine Learning
Programming Project #4
The purpose of this assignment is to give you a firm foundation in comparing a variety of linear classifiers.
In this project, you will compare two different algorithms, one of which you have already implemented. These
algorithms include Adaline and Logistic Regression. You will also use the same five datasets that you used
from Project 1 from the UCI Machine Learning Repository, namely:
1. Breast Cancer — https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Original%
29
This breast cancer databases was obtained from the University of Wisconsin Hospitals, Madison from
Dr. William H. Wolberg.
2. Glass — https://archive.ics.uci.edu/ml/datasets/Glass+Identification
The study of classification of types of glass was motivated by criminological investigation.
3. Iris — https://archive.ics.uci.edu/ml/datasets/Iris
The data set contains 3 classes of 50 instances each, where each class refers to a type of iris plant.
4. Soybean (small) — https://archive.ics.uci.edu/ml/datasets/Soybean+%28Small%29
A small subset of the original soybean database.
5. Vote — https://archive.ics.uci.edu/ml/datasets/Congressional+Voting+Records
This data set includes votes for each of the U.S. House of Representatives Congressmen on the 16 key
votes identified by the Congressional Quarterly Almanac.
When using these data sets, be careful of some issues.
1. Not all of these data sets correspond to 2-class classification problems. A method for handling multiclass classification was described for Logistic Regression. For Adaline, it is suggested that you use what
is called a “multi-net.” This is where you train a single network with multiple outputs. Note that if
you wish to apply a one-vs-one or one-vs-all strategy for the neural network, that is acceptable. Just
be sure to explain your strategy in your report.
2. Some of the data sets have missing attribute values. When this occurs in low numbers, you may simply
edit the corresponding values out of the data sets. For more occurrences, you should do some kind of
“data imputation” where, basically, you generate a value of some kind. This can be purely random, or
it can be sampled according to the conditional probability of the values occurring, given the underlying
class for that example. The choice is yours, but be sure to document your choice.
3. Most of attributes in the various data sets are either multi-value discrete (categorical) or real-valued.
You will need to deal with this in some way. For the multi-value situation, you can apply what is called
“one-hot coding” where you create a separate Boolean attribute for each value. For the continuous
attributes, you may use one-hot-coding if you wish but, there is actually a better way. Specifically,
it is recommended that you normalize them first to be in the range −1 to +1 and apply the inputs
directly. (If you want to normalize to be in the range 0 to 1, that’s fine. Just be consistent.)
For this project, the following steps are required:
• Download the five (5) data sets from the UCI Machine Learning repository. You can find this repository
at http://archive.ics.uci.edu/ml/. All of the specific URLs are also provided above.
• Pre-process each data set as necessary to handle missing data and non-Boolean data (both classes and
attributes).
• Implement Adaline and Logistic Regression.
1
• Run your algorithms on each of the data sets. These runs should be done with 5-fold cross-validation
so you can compare your results statistically. You can use classification error, cross entropy loss, or
mean squared error (as appropriate) for your loss function.
• Run your algorithms on each of the data sets. These runs should output the learned models in a
way that can be interpreted by a human, and they should output the classifications on all of the test
examples. If you are doing cross-validation, just output classifications for one fold each.
• Write a very brief paper that incorporates the following elements, summarizing the results of your
experiments. Your paper is required to be at least 5 pages and no more than 10 pages using the JMLR
format You can find templates for this format at http://www.jmlr.org/format/format.html. The
format is also available within Overleaf.
1. Title and author name
2. Problem statement, including hypothesis, projecting how you expect each algorithm to perform
3. Brief description of your experimental approach, including any assumptions made with your algorithms
4. Presentation of the results of your experiments
5. A discussion of the behavior of your algorithms, combined with any conclusions you can draw
6. Summary
7. References (Only required if you use a resource other than the course content.)
• Submit your fully documented code, the video demonstrating the running of your programs, and your
paper.
• For the video, the following constitute minimal requirements that must be satisfied:
– The video is to be no longer than 5 minutes long.
– The video should be provided in mp4 format. Alternatively, it can be uploaded to a streaming
service such as YouTube with a link provided.
– Fast forwarding is permitted through long computational cycles. Fast forwarding is not permitted
whenever there is a voice-over or when results are being presented.
– Provide sample outputs from one test set showing classification performance on Adaline and
Logistic Regression
– Show a sample trained Adaline model and Logistic Regression model
– Demonstrate the weight updates for Adaline and Logistic Regression. For Logistic Regression,
show the multi-class case
– Demonstrate the gradient calculation for Adaline and Logistic Regression. For Logistic Regression,
show the multi-class case
– Show the average performance over the five folds for Adaline and Logistic Regression
Your grade will be broken down as follows:
• Code structure – 10%
• Code documentation/commenting – 10%
• Proper functioning of your code, as illustrated by a 5 minute video – 30%
• Summary paper – 50%
2