$30
Lab 7: Cosine and Euclidean Distance
Follow ALL
instructions otherwise you will lose points. In this lab, you will be classifying data using cosine
distance, Euclidean distance, and Manhattan distance.
Required Knowledge:
You have already learned about the cosine formula:
���� = � ∙ �
|�||�|
You can find the angle created by two vectors by using this formula.
You have also learned how to find the distance between two vectors:
� = C(� − �) ∙ (� − �)
Background:
The idea is to classify data based on these distance metrics. For example, if you have two
vectors that have a relatively small angle between them, they could be similar.
Figure 1: using cosine similarity
Another example: two vectors can be measured using traditional distance to see how far
apart the two vectors are. If they are far apart (shown in green), then they could be
unrelated. If the vectors are close, then they are more likely to be similar.
Figure 2: using Euclidean distance metric
Another distance metric is the Manhattan distance. This is a little different from Euclidean
distance because it is not a straight-line distance. Manhattan distance is calculated by
finding the difference in each dimension and adding it all together. The larger the
Manhattan distance is, the less likely it is that the two vectors are similar. Similarly, the
smaller the Manhattan distance is, the more likely it is that the two vectors are similar.
Figure 3: using Manhattan distance metric
Walkthrough:
You are given a csv file of the famous iris dataset. I have also attached a copy of the dataset
on BeachBoard. Please note that there is no header in the csv file. Based on this data, you
are going to use the three metrics (cosine similarity, Euclidean similarity, and Manhattan
similarity) to predict which iris type a provided test vector should be classified as.
To do this, you need an anchor or a reference vector that will represent each class. You will
need to find the average vector of all vectors in the dataset for each iris class. These
averages will represent your “center” for all vectors. The center that has the smallest
distance metric from the test vector will be the best match when classifying the iris.
You will need to find the average of all iris-setosa rows, all iris-versicolor rows, and all irisvirginica rows. This will need to be returned as a dataframe. See image below.
Now that you have this, you can try to classify your data using various data metrics. Let’s
look at the first test vector:
test_vec1 = [5.1,3.4,1.5,0.2]
The goal is to correctly classify what kind of iris this is. Is this an iris-versicolor, an irissetosa, or an iris-virginica.
The first similarity metric is cosine similarity:
���� = � ∙ �
|�||�|
We can find the angle formed between test_vec1 and each center-vector. The centervector that forms the smallest angle is the most likely iris class.
Iris-versicolor Iris-setosa Iris-virginica
b =
[5.936,2.770,4.260,
1.326]
���� = ����_���1 ∙ �
|����_���1||�|
� = 22.145°
b =
[5.006,3.418,1.464,
0.244]
���� = ����_���1 ∙ �
|����_���1||�|
� = 0.768°
b =
[6.588,2.974,5.552,
2.026]
���� = ����_���1 ∙ �
|����_���1||�|
� = 27.169°
Since the iris-setosa center creates the smallest angle, we can assume that test_vec1 is
an iris-setosa
The second similarity metric is Euclidean distance similarity:
� = C(� − �) ∙ (� − �)
We can find the Euclidean distance between test_vec1 and each center-vector. The
center-vector that is the closest to test_vec1 most likely iris class.
Iris-versicolor Iris-setosa Iris-virginica
b =
[5.936,2.770,4.260,
1.326]
� = #(����_���1 − �) ∙ (����_���1 − �)
� = 3.159
b =
[5.006,3.418,1.464,
0.244]
� = #(����_���1 − �) ∙ (����_���1 − �)
� = 0.111
b =
[6.588,2.974,5.552,
2.026]
� = #(����_���1 − �) ∙ (����_���1 − �)
� = 4.706
Since the iris-setosa center is the closest to test_vec1 using Euclidean distance, we can
assume that test_vec1 is an iris-setosa.
The third similarity metric is Manhattan distance similarity:
� = j|�[�] − �[�]|
!
"
We can find the Euclidean distance between test_vec1 and each center-vector. The
center-vector that is the closest to test_vec1 most likely iris class.
Iris-versicolor Iris-setosa Iris-virginica
b =
[5.936,2.770,4.260,
1.326]
� = j|����_���1[�] − �[�]|
#
"
� = 5.352
b =
[5.006,3.418,1.464,
0.244]
� = j|����_���1[�] − �[�]|
#
"
� = 0.192
b =
[6.588,2.974,5.552,
2.026]
� = j|����_���1[�] − �[�]|
#
"
� = 7.792
Since the iris-setosa center is the closest to test_vec1 using Manhattan distance, we can
assume that test_vec1 is an iris-setosa.
Task:
1. Take a close look at the metric_similarity.py file. There are four functions
that you need to fill in: find_class_averages() and
most_similar_cosine() and most_similar_euclid() and
most_similar_manhattan(). Read through all of their descriptions carefully.
find_class_averages():Returns a pandas dataframe with no headers
where each row represents the average values of the class. The last column should
indicate the class. Round each value to three decimal places.
most_similar_cosine(): Find the class that best matches the test input
using cosine similarity . Whichever vector in the class list has the smallest angle
from the test vector is the class that you want to return.
most_similar_euclid(): Find the class that best matches the test input
using Euclid’s distance. Whichever vector in the class list has the smallest distance
from the test vector is the class that you want to return.
most_similar_manhattan(): Find the class that best matches the test input
using Manhattan distance. Whichever vector in the class list has the smallest
distance from the test vector is the class that you want to return.
Remember, you will lose points if you do not follow the instructions. We are using a
grading script. Some important notes:
• Though you are using iris.csv for your tests, you CANNOT assume that
we will be using the same file for other test cases. We can use a file with a
different number of columns and/or rows. The classes may also be different.
However, you can assume that each row of the csv file is a data vector for a
class. You may also assume that the last column by the class (as a str). The
remaining columns will be the data (numeric values). Note that the csv file
does not have headers
• Do NOT use library functions that will compute the distance metrics for you!
Do NOT import other libraries such as scipy or sklearn. You will get an
automatic zero! If you are unsure about a function, please ask me.
2. Your job is to implement most_similar_cosine() and
most_similar_euclid() and most_similar_manhattan()so that it passes
any test case. There are nine sample test cases provided for you, but these are not
the only cases that we will test. We will be testing other test cases in the same way
the test cases are presented.
3. After completing these functions, comment out the test cases (or delete them) or
else the grading script will pick it up and mark your program as incorrect.
4. Convert your metric_similarity.py file to a .txt file. Submit your
metric_similarity.py file and your .txt file on BeachBoard. Do NOT submit
it in compressed folder.
Some helpful functions (feel free to google these functions to get more details)
Function name What it does
df_name.loc[condition] Returns all rows that fulfills condition. This
results in a series.
Sub_df = df.loc[df[4] == ‘hello’] =>
Sub_df is a sub-dataframe that contains all rows
from df where column 4 contains ‘hello’
df_name.mean() Returns the average of all rows in the dataframe
df_name.round(num) Rounds all values in df_name to num decimal
places
df_name.append(series_name,
ignore_index=True)
Appends series_name onto df_name
df_name.insert(index,
col_name, values)
Inserts a new column consisting of values with a
column name of col_name at index
np.linalg.norm(arr) Calculates the norm/length of arr
math.acos(val) Calculates arccos (���) or ���$%(���). Note that this
returns a value in radians and not degrees!
math.sqrt(val) Calculates √���
np.abs(arr) Takes absolute value of array
np.abs([-1, 2, -3]) -> [1, 2, 3]
Grading rubric:
To achieve any points, your submission must have the following. Anything missing from
this list will result in an automatic zero. NO EXCEPTIONS!
• Submit everything: py file, txt file
• Program has no errors (infinite loops, syntax errors, logical errors, etc.) that
terminates the program
Please note that if you change the function headers or if you do not return the proper
outputs according to the function requirements, you risk losing all points for those test
cases.
Points Requirement
10 Implemented find_class_averages() correctly
10 Implemented most_similar_cosine() correctly
10 Implemented most_similar_euclid() correctly
10 Implemented most_similar_manhattan() correctly
TOTAL: 40
** Note that there are no points for passing the original test cases. Because this is an extra credit
assignment, your points will come purely from hidden test cases.