$29
ENGR 421 DASC 521
Homework 07: Modeling Late Payments for Credit Card Bills
In this homework, you will develop a machine learning solution in R, Matlab, or Python for three
real-life classification problems from finance industry. Your machine learning algorithm needs to
predict whether a customer will delay his/her credit card bill payment more than 1 day (named as
target1), more than 31 days (named as target2), or more than 61 days (named as target3) using
the information given about each customer. Here are the steps you need to follow:
1. For each binary classification problem, you are given three input data files.
a. For the first problem, the files are named as hw07_target1_training_data.csv,
hw07_target1_training_label.csv, and hw07_target1_test_data.csv. The training
and test sets contain 11,000 and 5,813 data instances, respectively, where each
data instance has 162 features.
b. For the second problem, the files are named as hw07_target2_training_data.csv,
hw07_target2_training_label.csv, and hw07_target2_test_data.csv. The training
and test sets contain 9,000 and 4,752 data instances, respectively, where each data
instance has 211 features.
c. For the third problem, the files are named as hw07_target3_training_data.csv,
hw07_target3_training_label.csv, and hw07_target3_test_data.csv. The training
and test sets contain 5,000 and 2,951 data instances, respectively, where each data
instance has 202 features.
You are also given a very simple solution strategy using a boosting classifier in the file
named hw07_quick_and_dirty_solution.R.
2. Develop your own machine learning solution for these three problems. You are free to
use any publicly available packages in R, Matlab, or Python. The predictive quality of
your solutions will be evaluated in terms of AUROC (area under the receiver operating
characteristics curve) values on the test sets.
3. Use the trained algorithms from the previous step to perform predictions for the test data
sets, which contain 5,813, 4,752, and 2,951 customers for three problems. You are not
given the correct labels for test instances. You need to predict the scores or posterior
probabilities for positive class in each problem and to write these estimates into three
files. For example, the strategy implemented in hw07_quick_and_dirty_solution.R file
generates the estimates for the test sets and writes these values into three different files
named as hw07_target1_test_predictions.csv, hw07_target2_test_predictions.csv and
hw07_target3_test_predictions.csv.
What to submit: You need to submit your source code in a single file (.R file if you are using R,
.m file if you are using Matlab, or .py file if you are using Python), the estimated scores or
posterior probabilities for positive class on the test sets (hw07_target1_test_predictions.csv,
hw07_target2_test_predictions.csv, and hw07_target3_test_predictions.csv), and a detailed
report explaining your approach (.doc, .docx, or .pdf file). You will put these five files in a single
zip file named as STUDENTID.zip, where STUDENTID should be replaced with your 7-digit
student number.
How to submit: Submit the zip file you created to Blackboard. Please follow the exact style
mentioned and do not send a zip file named as STUDENTID.zip.