$30
Lab 6 (pt. 2): Multivariate Linear Regression
This is an INDIVIDUAL assignment. Due date is as indicated on BeachBoard. Follow ALL
instructions otherwise you may lose points. In this lab, you will be making predictions based on
more elaborate data. You will also analyze the accuracy of your model. You will need to use numpy
and pandas for this lab. Please note that since the labs will be graded separately, there will be
separate resubmissions for lab 6 (pt1) and lab 6 (pt2).
Review:
In the previous lab, we discussed how you can find the slope and y-intercept of univariate
data by utilizing the formula below…
>
�
� A = (�!�)"#�!�
However, what would you do if the data that was presented to you contains more columns
of data?
Multivariate Linear Regression:
The presented data was in a form such that there was only one variable (x). However, it is
very realistic for data to rely on multiple independent variables. For this example, we will
use the famous real estate data set. I have attached an edited version on BeachBoard for
your convenience. Please use the data that I provided for you on BeachBoard. Otherwise,
your answers and your formatting will not match. A portion of the csv file (real estate
train.csv) is shown below
X1: Transaction date
X2: House age
X3: Distance to the nearest MRT station
X4: Number of convenience stores
X5: Latitude
X6: Longitude
Y: house price of unit area
The goal is to find a multivariate linear equation that can predict the house price of the unit
area using all of the mentioned features.
� = �1�1 + �2�2 + �3�3 + �4�4 + �5�5 + �6�6 + �
We simply apply the linear algebra method above, but in a larger scale. Based on the
snippet from the table, we can create our matrix equations from this.
b
37.9
42.2
47.3
⋮
f = g
2012.917 32 84.87882 10 24.98298 121.54024 1
2012.9178 19.5 306.5947 9 24.98034 121.53951 1
2013.583
⋮
13.3
⋮
56109845
⋮
5
⋮
24.98746
⋮
121.54391
⋮
1
⋮
j
⎣
⎢
⎢
⎢
⎢
⎢
⎡
�1
�2
�3
�4
�5
�6
� ⎦
⎥
⎥
⎥
⎥
⎥
⎤
Looks familiar? This is still the same thing as our � = � >
�
� A setup from before. However,
we have multiple m values instead of just one. So instead of � = � >
�
� A, our matrix equation
looks more like…
� = �
⎣
⎢
⎢
⎢
⎢
⎢
⎡
�1
�2
�3
�4
�5
�6
� ⎦
⎥
⎥
⎥
⎥
⎥
⎤
Thus, the approach is still the same.
⎣
⎢
⎢
⎢
⎢
⎢
⎡
�1
�2
�3
�4
�5
�6
� ⎦
⎥
⎥
⎥
⎥
⎥
⎤
= (�!�)"#�!�
�ℎ��� � = g
2012.917 32 84.87882 10 24.98298 121.54024 1
2012.9178 19.5 306.5947 9 24.98034 121.53951 1
2013.583
⋮
13.3
⋮
56109845
⋮
5
⋮
24.98746
⋮
121.54391
⋮
1
⋮
j, y = b
37.9
42.2
47.3
⋮
f
The result is
⎣
⎢
⎢
⎢
⎢
⎢
⎡
�1
�2
�3
�4
�5
�6
� ⎦
⎥
⎥
⎥
⎥
⎥
⎤
=
⎣
⎢
⎢
⎢
⎢
⎢
⎡ 4.95347
−0.2696252 −0.0044963
1.11479937
230.797555
−13.593203
−14039.67836⎦
⎥
⎥
⎥
⎥
⎥
⎤
≈
⎣
⎢
⎢
⎢
⎢
⎢
⎡ 4.9535
−0.2696
−0.0045
1.1148
230.7976
−13.5932
−14039.6784⎦
⎥
⎥
⎥
⎥
⎥
⎤
�ℎ�� ������� �� 4 ������� ������
Therefore, if I wanted to predict what the house price for a unit with the following data
(from first data row in real estate test.csv):
X1: Transaction date is 2013.25
X2: House age is 26.8
X3: Distance to the nearest MRT station is 482.7581
X4: Number of convenience store is 5
X5: Latitude is 24.97433
X6: Longitude is 121.53863
We can easily predict the house price for the unit (y) by plugging in what we know into the
formula
� = �
⎣
⎢
⎢
⎢
⎢
⎢
⎡
�1
�2
�3
�4
�5
�6
� ⎦
⎥
⎥
⎥
⎥
⎥
⎤
� = [2013.25 26.8 482.7581 5 24.97433 121.53863 1] ×
⎣
⎢
⎢
⎢
⎢
⎢
⎡ 4.9535
−0.2696
−0.0045
1.1148
230.7976
−13.5932
−14039.6784⎦
⎥
⎥
⎥
⎥
⎥
⎤
= 401.0483 …
≈ 41.0483 �ℎ�� ������� �� ���� ������� ������
If you look at the actual price, from the csv file (real estate test.csv), you’ll find
that the actual value is 35.5. That means that we have an absolute error by using the
following formula:
��� ��� = |��������� − ������|
= |41.0483 − 35.5| = 5.5483
We can also calculate the relative error by using the following formula:
��� ��� = |%&'()*+'(",*+-,.|
,*+-,.
= |/#.1/23"34.4|
34.4 = 0.1563 �ℎ�� ������� �� 4 ������� ������
This means that the predicted answer is off by 15.63%
We can better test the accuracy of the linear regression model by finding the mean absolute
error (MAE) and the mean relative error (MRE).
��� = #
5ã|���������) − ������)|
5
)
��� = #
5ã|���������) − ������)|
������)
5
)
Where n is the total number of cases and i is the iteration/case number.
You can see a summary of the test results on the next page.
After calculating all of the absolute errors and the relative errors, we can find the MAE and
the MRE. Overall, based on the test data, our model has a 15.18% error on average.
Task:
The purpose of the second part of the lab is to create a framework that will take a csv file
with any number of columns and will create a linear regression model. You will also
analyze the accuracy of this linear regression model.
1. Take a close look at the multi_lin_reg.py file. There are four empty functions:
multivar_linreg(file_name) and predict(inputs, file_name) and
MAE(inputs, file_name) and MRE(inputs, file_name). Read through all of
their descriptions carefully. Remember, you will lose points if you do not follow the
instructions. We are using a grading script
Summary of function tasks
Multivar_linreg(file_name):
Given the csv file_name, find all of the weights and return these values in a numpy
array. This 1xn numpy array should contain [m1, m2, m3, … , b]in this order.
Round all values to four decimal places.
predict(inputs, file_name):
Given inputs, which is a numpy array of all of the weights [m1, m2, m3, … , b],
make predictions from the data given in file_name. The predictions will be stored in a
1xm numpy array [y1, y2, y3, …]. Each row of data from the csv should have a
prediction. Round all values to four decimal places
MAE(inputs, file_name):
Find the mean absolute error of the linear regression model given by inputs, which is a
numpy array of all of the weights [m1, m2, m3, … , b]. The mean absolute error will
be calculated by testing the linear regression model with the data from file_name. Round
all values to four decimal places.
MRE(inputs, file_name):
Find the mean relative error of the linear regression model given by inputs, which is a
numpy array of all of the weights [m1, m2, m3, … , b]. The mean relative error will
be calculated by testing the linear regression model with the data from file_name. Round
all values to four decimal places.
Some important notes:
• Though this example uses six columns (six independent variables), other test cases
may use more or less columns. However, there will be at least one independent
variables.
• For consistency’s sake, do not round until the very end. Meaning you should not
round anything until you return your answers.
• If you want to create extra functions/methods to assist you, feel free to do so.
However, we will only be testing the three functions that are originally in the file.
• If you use any library’s linear regression or least squares method function, you will
get an automatic zero. You must implement this on your own!
2. Your job is to implement all four of these functions so that it passes all test cases. We
provide two csv files. Real estate train.csv is for
multivar_linreg()and Real estate test.csv is for the other functions.
However, we will be using other data sets and csv files to check if your work is
correct.
3. By running the provided test cases, you should get the following results:
4. After completing these functions, comment out the test cases (or delete them) or
else the grading script will pick it up and mark your program as incorrect. Ensure
that you have commented out or deleted ALL print statements. You risk losing
points if your file prints anything.
5. Convert your multi_lin_reg.py file to a .txt file. Submit your
multi_lin_reg.py file and your .txt file on BeachBoard. Do NOT submit it in
compressed folder.
Some helpful functions (refer to part 1 for other helpful functions)
Function name What it does
np.round(array, num) Rounds all elements in array to num decimal places
Example: np.round([0.1234, 0.6545], 3) =>
[0.123, 0.655]
df_name.shape Gets the dimension of the data frame
df.shape => (num_rows, num_columns)
np.append(x, y) Appends y to the end of x. See documentation here
Grading rubric:
To achieve any points, your submission must have the following. Anything missing from
this list will result in an automatic zero. NO EXCEPTIONS!
• Submit everything: py file, txt file, and pdf file
• Program has no errors (infinite loops, syntax errors, logical errors, etc.) that
terminates the program
Please note that if you change the function headers or if you do not return the proper
outputs according to the function requirements, you risk losing all points for those test
cases.
Points Requirement
5 Submission is correct. All two files are part of submission (py file and txt
file)- All or nothing
15 Implemented multivar_linreg() correctly (three other cases not
including Real estate)
12 Implemented predict() correctly (three other cases not including
Real estate)
6 Implemented MAE() correctly (three other cases not including Real
estate)
6 Implemented MRE() correctly ((three other cases not including Real
estate)
5 Passes original test cases (test cases on python file have been
commented out too)- all or nothing
TOTAL: 49