$30
CSCE 623: Machine Learning
HW1
Your homework will be composed of an integrated written portion and Python programming component. You will
produce a single jupyter notebook file (*.ipynb). You will be using the Auto.csv dataset provided. In your answers to
written questions, even if the question asks for a single number or other form of short answer (such as yes/no or which is
better: A or B) you must provide supporting information for your answer to obtain full credit. Use Python to perform
calculations or mathematical transformations, or provide python-generated graphs and figures or other evidence that
explain how you determined the answer. Use both code cells and markup cells in your jupyter notebook. A shell is
provided to get you started.
Simple Linear Regression
1. Load the “Auto.csv” dataset (note that missing values (e.g. “?”) must be handled – one suggestion is to remove
unneeded data observations). Store the data in a pandas dataframe called “data”
2. Explore the dataset. Useful pandas functions include .info and .hist as well as scatter_matrix in
pandas.tools.plotting
a. Display statistics of the dataset. How many numerical features/attributes are there? How many observations/
datapoints?
b. Display a histogram of each of the individual feature values. Describe these distributions in terms of
descriptions from statistics (e.g. uniform, Gaussian, exponential, skewed, multi-modal)
c. Choose a subset of at least 5 attributes you expect to have relationships and display a scatterplot of each of the
pairings between each possible pair of these attributes. What pairs do you see with linear relationships? Nonlinear? Which pairs have strong relationships and which appear to have weak relationships? Describe the
phenomenon that you see in your plots.
3. Make a scatterplot (Horsepower vs mpg), Set the axes so that the origin (0,0) is included, as well as all of the
datapoints. Label axes appropriately: “Horsepower”, “MPG”). On this Horsepower vs. MPG plot, assume that β0 is
fixed at 40. Estimate what the slope β1 of the best fit line is for the dataset (eyeball an educated guess) given that β0 is
fixed at 40. Report your eyeball estimate for β1 using a markdown cell in jupyter.
4. Using code, make a vector of possible β1 values that surround what you think the slope of the best fit line is (hint: use
the linspace function in numpy). Display the vector of these numerical β1 values.
5. Make a python function “rss1d(beta0,beta1,x,y)” for computing cost: this function should compute residual
sum of squared errors (RSS) for the dataset for a given β0 and β1. Then use this function to compute RSS for the fixed
β0 under each version of β1 coefficients from step 4 and store these costs for each value of β1. You may find a loop
might handy here.
6. Using your results from step 5, make a new plot of β1 value vs RSS cost. Your axes should be labeled as β1 on the xaxis and RSS on the y-axis). If possible, see if you can make the subscripted beta appear as math-style text in the xaxis label.
7. Answer these questions in your report: Describe the shape of the plot in step 6? Explain how using the plot, someone
could find the best value of β1. Select the value of β1 you think will have the best fit (you may want to improve your
estimate by exploring near it by adding additional values for β1 and repeat steps 3-6).
8. Determine the linear regression line formed when β0 is 40 and the value of β1 you computed in step 7. Make a new
plot which displays a red linear regression line overlayed on a Horsepower vs. MPG scatterplot of the original dataset
points
9. Review eqn 3.4 on page 62. In code, develop the closed-form function computeBetas(xVec, yVec) which
accepts a vector of x values and a vector of y values and returns betas, which is a structure containing the values for
the 2 coefficients β0 and β1
10. Compute β0 and β1 for the Auto dataset using the closed-form function you created in step 9.
11. How does the closed-form computed value of β1 compare with your estimate of β1 from step 6? Discuss in your
report.
12. Make a new plot which displays a green linear regression line formed by the closed-form expression (from step 9 &
10) overlayed on a Horsepower vs. MPG scatterplot of the original dataset points.
13. Now use sklearn’s linear_model function to fit a linear model from horsepower to mpg. What are the model’s
coefficients, MSE & explained variance score?
14. Make a new plot which displays a black linear regression line formed by the sklearn linear model (from step 12)
overlayed on a Horsepower vs. MPG scatterplot of the original dataset points.
15. Explore the residual errors from using the linear model to make predictions:
a. Compute the residual errors in using the model to predict mpg from horsepower. Plot these residual errors as
a function of horsepower using a scatterplot. Add a red horizontal line at y=0 to indicate the zero-error
position.
b. Describe the plot - particularly the trends. Do the errors appear well-distributed, or are there trends? If there
are trends: describe the trends, explain what these trends indicate about the ability to predict mpg from
horsepower using a linear model, and give at least one course of action you could take to make a better model.
Optional (not required … but good practice in developing your coding skills): build a structure containing possible values
for β1 and β0 pairs. Compute the RSS over all beta pairs at each cell in the matrix on the horsepower vs. MPG data. Now
build a contour and/or 3D plot of these RSS values as shown in the book Figure 3.2 on page 63 (the x and y axes are β1
and β0 and the z axis is RSS). Write code to determine the beta pair with the minimum RSS. Report the minimum value
cost. On your contour/3D plot, add a point at the location of the β0, β1 coordinates which minimize the RSS.
Helpful Tips
You might find these python packages/imports useful:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn import datasets, linear_model