Starting from:

$29.99

Assignment 5: Machine learning

Assignment 5: Machine learning
This assignment will give you a chance to implement machine learning techniques on a realistic classification problem.
Step 1: Nearest neighbor. Implement a nearest neighbor classifier to predict the label of each test image. Write a program that is run from the command line like this:
python orient.py train_file.txt test_file.txt nearest
For each image to be classified, the program should find the “nearest” image in the training file, i.e. the one with the closest distance (least vector difference) in Euclidean space. It should then display the classification accuracy (in terms of percentage of correctly-classified images) as well as a confusion matrix, and should output a file called nearest output.txt which indicates the estimated label for each image in the test file. The confusion matrix is a table with four rows and four columns. The entry at cell i,j in the table should show the number of test exemplars whose correct label is i, but that were classified as j (e.g. a “10” in the second column of the third row would mean that 10 test images were actually oriented at 180 degrees, but were incorrectly classified by your classifier as being at 90 degrees). A perfect classifier will have a diagonal confusion matrix, but we’re unlikely to achieve that here. The nearest output.txt file should coorespond to one test image per line, with the photo id, a space, and then the estimated label, e.g.:
test/124567.jpg 180 test/8234732.jpg 0
Step 2: Adaboost. Implement a technique using decision stumps and Adaboost. Use very simple decision stumps that simply compare one entry in the image matrix to another, e.g. compare the red pixel at position 1,1 to the green pixel value at position 3,8. You can try all possible combinations (roughly 1922) or randomly generate some pairs to try. Your program should be run like this:
python orient.py train_file.txt test_file.txt adaboost stump_count
and should train on the training file with the specified number of stumps, test on the testing file, and display the classification accuracy, confusion matrix, and output adaboost output.txt, with the same format as in step 1.
Step 3: Neural network classification. Implement a fully-connected feed-forward network to classify
3
image orientation, and implement the backpropagation algorithm to train the network using gradient descent. Your network should have one hidden layer (i.e. three layers total – the input layer, the hidden layer, and the output layer). Your program should be run like this:
python orient.py train_file.txt test_file.txt nnet hidden_count
and should train on the training file, using the specified number of hidden nodes, test on the testing file, and display the classification accuracy, confusion matrix, and output nnet output.txt, with the same format as in step 1.
Step 4: Analysis and improvement. Each of the above machine learning techniques has a number of parameters and design decisions. For example, neural networks have network structure parameters (number of layers, number of nodes per hidden layer), as well as which activation function to use, whether to use traditional or stochastic gradient descent, which learning rate to use, etc. It would be impossible to try all possible combinations of all of these parameters, so identify a few parameters and conduct experiments to study their effect on the final classification accuracy. In your report, present neatly-organized tables or graphs showing classification accuracies and running times as a function of the parameters you choose. Which classifiers and which parameters would you recommend to a potential client? How does performance vary depending on the training dataset size, i.e. if you use just a fraction of the training data? Show a few sample images that were classified correctly and incorrectly. Do you see any patterns to the errors?
Finally, modify your code so that when run like this:
python orient.py train_file.txt test_file.txt best model_file
it uses whichever algorithm and parameter settings you recommend to give the best accuracy. The model file is an optional parameter that you can choose to implement; if you’d like, your training routines can write their learned model to disk, so that when we run the “best” algorithm, training does not actually occur but instead a pre-trained model file is loaded. Please use this option if your models take a long time to train (more than a few minutes). As in the past, a small percentage of your assignment grade will be based on how accurate your “best” algorithm is with respect to the rest of the class. We will use a separate test dataset, so make sure to avoid overfitting!

More products