$30
Lab 6: Linear Regression
Follow ALL
instructions otherwise you may lose points. In this lab, you will be finding the best fit line using two
methods. You will need to use numpy, pandas, and matplotlib for this lab.
Background (least squared regression):
Least squared regression is a popular method to find the line of best fit. Although I wanted to go
over how to do it in class, we don’t have time to do it. I’ll do my best to explain it through these
words and examples on this paper L
The goal is to calculate the slope (m) and y-intercept (b) in the equation of the line:
� = �� + �
The steps to compute the line of best fit for � ordered pairs:
1. For each point (x, y), calculate x2 and xy
2. Find ∑ � , ∑ � , ∑ �! , ∑ ��
3. Calculate the slope (N is the number of ordered pairs):
� = � ∑ �� − ∑ � ∑ �
� ∑ �! − (∑ �)!
� = (20)(16718.5006) − (505.748847)(507.922204)
20(16655.073) − 505.748847! = 1.00219
4. Calculate the y-intercept:
� = ∑ � − � ∑ �
�
� = 507.922204 − 1.00219(505.748847)
20 = 0.05327206
5. Make our equation � = �� + �
� = 1.00219� + 0.05327206
The graph is shown below (I used Excel not Python). The line of best fit is graphed and so are
the points that we used to find the line of best fit.
So, now that you’ve seen the algebraic method, let’s see the linear algebra method!
The setup is based on this matrix equation:
� = � ;
�
� <
� is a nx1 matrix of y-coordinates
X is an nx2 matrix where the first column is the x-coordinate. The second column is 1 for matrix
multiplication purposes.
To find the slope (m) and the y-intercept (b), use…
;
�
� < = (�"�)#$�"�
Let’s use the same points as last time to find the best fit line with this method.
Note: X (not x) is a matrix and it looks like this:
The first column has all of the x’s (like in the previous example). The second column is full of
1’s. This is for the y-intercept.
The y is the same as in the last example.
The calculations are as follows:
�"� = =16655 505.749
505.75 20 >
(�"�)#$ = =
0.00025867 −0.006541
−0.006541 0.21540568>
�"� = ;
16718.5006
507.922204<
;
�
� < = (�"�)#$�"� = ;
1.0022
0.0533<
And we get the same results! J
Task:
1. Take a close look at the lin_reg.py file. There are four empty functions:
least_sq(file_name) and mat_least_sq(file_name) and
predict(file_name, x) and plot_reg(file_name, using_matrix). Read
through all of their descriptions carefully. Remember, you will lose points if you do
not follow the instructions. We are using a grading script
Summary of function tasks
least_sq(file_name):
Given the csv file_name, find the slope and y-intercept of the data using algebraic least
squares (the first linear regression presented). You need to return the slope and yintercept IN THAT ORDER. Round the slope and y-intercept to four decimal places.
mat_least_sq(file_name):
Given the csv file_name, find the slope and y-intercept of the data using linear
algebraic least squares using matrices (the second linear regression presented). You
need to return the slope and y-intercept IN THAT ORDER. Round the slope and yintercept to four decimal places.
predict(file_name, x):
Given the csv file_name and an input value x, predict what the output would be using
the equation that is derived from mat_least_sq(). This means that you should be
calling mat_least_sq() in this function. Round the predicted output to four decimal
places before returning the value.
plot_reg(file_name, using_matrix):
Given the csv file_name and an indicator of which linear regression method to use
using_matrix, output a graph of the data points and the line of best fit.
• If using_matrix=False, then you should be plotting your results from
least_sq. You should be using red for everything in the graph with X markers for
the data points.
• If using_matrix=True, then you should be plotting your results from
mat_least_sq. You can use any color but the default blue and red. You can use any
data point marker except for the default dot and X.
plot_reg() should not return anything. Your graphs should also contain the
following:
• Labeled x axis
• Labeled y axis
• Graph Title
• Legend (see example for details)
Some important notes:
• For consistency’s sake, do not round until the very end. Meaning you should not
round anything until you return your answers.
• Hint: to plot the best fit line, find the smallest and largest x-coordinate. Plug these xcoordinates into the linear equation and plot them.
• If you want to create extra functions/methods to assist you, feel free to do so.
However, we will only be testing the three functions that are originally in the file.
• If you use any library’s linear regression or least squares method function, you will
get an automatic zero. You must implement this on your own!
2. Your job is to implement all four of these functions so that it passes all test cases. We
provide one csv file for you to test on (data.csv), but we will be using other data
sets and csv files to check if your work is correct.
3. By running the test case provided (data.csv), you should get the following
results:
Note: your “matrix using least squares” graph may have different colors and
markers from mine.
In NO CASE should your graphs have the dot marker or the blue color shown
above!
4. If you feel confident in your program so far, run your program after changing the test
case’s csv_file from “data.csv” to “data2.csv”
5. Take screenshots of the two graphs you obtain (one from using algebraic least
squares and the other from matrix least squares). Put these two screenshots in a pdf
or word file. You will be submitting this with your py and txt files
6. After completing these functions, comment out the test cases (or delete them) or
else the grading script will pick it up and mark your program as incorrect. Ensure
that you have commented out or deleted ALL print statements. You risk losing
points if your file prints anything.
7. Convert your lin_reg.py file to a .txt file. Submit your lin_reg.py file and
your .txt file AND YOUR PDF on BeachBoard. Do NOT submit it in compressed
folder. IN TOTAL, YOU SHOULD BE SUBMITTING THREE FILES!
Some helpful functions
Function name What it does
round(x, y) Rounds the value, x, to y decimal places:
Example: round(1.23456, 3) => 1.235
matrix_name.T Transposes matrix
np.ones(num) Creates a vector full of ones. There will be num
ones.
Example: np.ones(3) => s
1
1
1
t
np.column_stack((col1, col2)) Concatenates two 1d numpy arrays to make a 2d
numpy array.
If � = s
1
2
3
t ��� � = s
1
1
1
t
np.column_stack(x,b) => s
1 1
2 1
3 1
t
np.linalg.inv(mat_name) Finds the inverse of the matrix mat_name
Grading rubric:
To achieve any points, your submission must have the following. Anything missing from
this list will result in an automatic zero. NO EXCEPTIONS!
• Submit everything: py file, txt file, and pdf file
• Program has no errors (infinite loops, syntax errors, logical errors, etc.) that
terminates the program
Please note that if you change the function headers or if you do not return the proper
outputs according to the function requirements, you risk losing all points for those test
cases.
Points Requirement
5 Submission is correct. All three files are part of submission (py file, txt
file, and pdf file)- All or nothing
4 Graphs from pdf file (testing data2.csv) are correct- 2 points each
16 Implemented least_sq correctly (four other cases not including
data.csv and data2.csv)
16 Implemented mat_least_sq correctly (four other cases not including
data.csv and data2.csv)
8 Implemented predict correctly (four other cases not including data.csv
and data2.csv)
8 Implemented plot_reg correctly. Remember that least_sq and
mat_least_sq should be called here. (four other cases not including
data.csv and data2.csv)
8 Graphs have proper x-axis labels, y-axis labels, titles, and legends (1
point each)
5 Passes original test case (test cases on python file have been commented
out too)- all or nothing
TOTAL: 70