$29.99
BBM 103: Introduction to Programming Laboratory I
PROGRAMMING ASSIGNMENT 5
PERSONALITY CLASSIFICATION WITH K-NEAREST NEIGHBOR ALGORITHM
1. INTRODUCTION
The aim of this experiment is to make you get familiar machine learning problem and Python data
science libraries such as Pandas, NumPy and Matplotlib. In this experiment, you are expected to
implement K-Nearest Neighbor Algorithm to solve a real word classification problem with a data.
2. BACKGROUND
2.1. K-Nearest Neighbor Algorithm
The k-nearest neighbor algorithm, also known as KNN or k-NN, is a non-parametric, supervised
learning classifier, which uses proximity to make classifications or predictions about the grouping
of an individual data point. While it can be used for either regression or classification problems,
it is typically used as a classification algorithm, working off the assumption that similar points can
be found near one another.
For classification problems, a class label is assigned on the basis of a majority vote—i.e. the label
that is most frequently represented around a given data point is used. While this is technically
considered “plurality voting”, the term, “majority vote” is more commonly used in literature. The
distinction between these terminologies is that “majority voting” technically requires a majority
of greater than 50%, which primarily works when there are only two categories. When you have
multiple classes—e.g. four categories, you don’t necessarily need 50% of the vote to make a
conclusion about a class; you could assign a class label with a vote of greater than 25%.
Figure 1. KNN Classification Diagram
Fall 2022
BBM 103: Introduction to Programming Laboratory I
PROGRAMMING ASSIGNMENT 5 2
2.2. Compute K-Nearest Neighbor: Distance Metric
The goal of the k-nearest neighbor algorithm is to identify the nearest neighbors of a given query point, so
that we can assign a class label to that point. In order to do this, KNN has a few requirements:
2.2.1. Determine your distance metrics
In order to determine which data points are closest to a given query point, the distance between
the query point and the other data points will need to be calculated. These distance metrics help
to form decision boundaries, which partitions query points into different regions.
While there are several distance measures that we can choose from, you should use Euclidean
distance in this assignment.
Euclidean distance (p=2): This is the most commonly used distance measure, and it is limited to
real-valued vectors. Using the below formula, it measures a straight line between the query point
and the other point being measured.
𝑑(𝑥, 𝑦) = √∑(𝑦𝑖 − 𝑥𝑖)
2
𝑛
𝑖=1
2.2.2. Compute KNN: defining k
The k value in the k-NN algorithm defines how many neighbors will be checked to determine the
classification of a specific query point. For example, if k=1, the instance will be assigned to the same class
as its single nearest neighbor. Defining k can be a balancing act as different values can lead to overfitting
or underfitting. Lower values of k can have high variance, but low bias, and larger values of k may lead to
high bias and lower variance. The choice of k will largely depend on the input data as data with more
outliers or noise will likely perform better with higher values of k. Overall, it is recommended to have an
odd number for k to avoid ties in classification, and cross-validation tactics can help you choose the optimal
k for your dataset.
Fall 2022
BBM 103: Introduction to Programming Laboratory I
PROGRAMMING ASSIGNMENT 5 3
3. ASSIGNMENT
3.1. Packages
• pandas is the fundamental package for data manipulation and analysis. You will use it to analyze
and manipulate tabular data in DataFrame(s).
• numpy is the fundamental package for scientific computing with Python. You will use it to perform
a wide variety of mathematical operations on data/dataset.
• math is a library to make basic mathematical and statistical operations on DataFrame(s).
• random is a library used just for creating random integer in a given range.
In this assignment, you will implement a K-Nearest Neighbor Algorithm to classify different personality
types (16 personalities) of people. Furthermore, you will extend your implementation for the same
problem by implementing weighted K-Nearest Neighbor Algorithm.
3.2. Personality Classification Data
Fall 2022
BBM 103: Introduction to Programming Laboratory I
PROGRAMMING ASSIGNMENT 5 4
Figure 2. Personalities
The personality classification data will be downloaded from the given link.
This data contains the questions from the 16 Personality (given above) tests and the answers from
people. The answers are numerically encoded as below.
ANSWERS NUMERIC ENCODE
Fully Agree 3
Partially Agree 2
Slightly Agree 1
Neutral 0
Slightly Disagree -1
Partially Disagree -2
Fully Disagree -3
Fall 2022
BBM 103: Introduction to Programming Laboratory I
PROGRAMMING ASSIGNMENT 5 5
The personality classification data consists of 60.000 samples (rows) with discrete 16
(“Personality” attribute) ground- truth class types.
Each sample in the personality classification data contains 62 variables/attributes/features. 61 of
them are independent variables and 1 dependent variable. These independent variables except
“Response Id” represent the quantitative response with respect to their counterpart questions.
You have to remove/drop “Response Id” from the data. The dependent variable represents
“Personality” types.
3.3. Classification Performance Evaluation Metric
In order to measure the performance success of your K-Nearest Neighbor classification model, you will
compute “Accuracy”, “Precision” and “Recall” of your implemented K-Nearest Neighbor classification
model.
Accuracy: Accuracy is an performance evaluation metric for evaluating classification models. Informally, it
is the fraction of predictions the model got right.
Accuracy can be calculated in terms of positives and negatives as follows:
𝑨𝒄𝒄𝒖𝒓𝒂𝒄𝒚 =
TP + TN
TP + TN + FP + FN
where TP = True Positives, TN = True Negatives, FP = False Positives, and FN = False Negatives.
Example: Assume that we have a model that classified 100 tumors as malignant (the positive class) or
benign (the negative class). Let’s try calculating the accuracy of this model:
𝑨𝒄𝒄𝒖𝒓𝒂𝒄𝒚 =
TP + TN
TP + TN + FP + FN =
1 + 90
1 + 90 + 1 + 8
=
91
100 = 0.91
This model has an accuracy of 0.91 (91%). It means that this model has 91% correct predictions out of 100
total samples.
Fall 2022
BBM 103: Introduction to Programming Laboratory I
PROGRAMMING ASSIGNMENT 5 6
Precision: Precision is a performance evaluation metric which is measured in order to answer the following
question:
“What proportion of positive identifications was actually correct?”
Precision is defined as follows:
𝑷𝒓𝒆𝒄𝒊𝒔𝒊𝒐𝒏 =
TP
TP + FP
Example: Let’s try calculating the precision of classification model which analyzed/classified 100 tumors
𝑷𝒓𝒆𝒄𝒊𝒔𝒊𝒐𝒏 =
TP
TP + FP =
1
1 + 1
=
1
2
= 0.5
This model has a precision of 0.5. In other words, when it predicts a tumor is malignant, it is correct 50%
of the time.
Recall: Recall is a performance evaluation metric which is measured in order to answer the following
question:
What proportion of actual positives was identified correctly?
Recall is defined as follows:
𝑹𝒆𝒄𝒂𝒍𝒍 =
TP
TP + FN
Example: Let’s try calculating the recall of classification model which analyzed/classified 100 tumors
𝑹𝒆𝒄𝒂𝒍𝒍 =
TP
TP + FN =
1
1 + 8
=
1
9
= 0.11
This model has a recall of 0.11. In other words, it correctly identifies 11% of all malignant tumors.
3.4. Feature Normalization
In order to re-scale each independent variables (60 feature/attribute columns on the data) of your samples
between (0-1) range, you will use min-max normalization on them.
Min-max normalization (in other words Feature Scaling) is used to perform a linear transformation on the
original data. This technique gets all the scaled data in the range (0, 1).
The formula to achieve this is the following:
Fall 2022
BBM 103: Introduction to Programming Laboratory I
PROGRAMMING ASSIGNMENT 5 7
𝑛𝑖 =
𝑓𝑖 − min (𝑓)
max(𝑓) − min (𝑓)
3.5. Error Analysis for Classification
In this part of the assignment, you are expected to
• find a few misclassified samples and explain why you think these misclassified samples were hard
to classify.
• make general performance analysis for the concepts Effect of Neighbor Number (k), Effect of
Normalization, Effect of Algorithm, Effect of K-fold, Accuracy, Precision and Recall.
3.6. Steps to Follow for Classification
1. Download personality classification data
2. Read personality classification data. For this, you should use a Pandas DataFrame which is a 2-
dimensional data structure like a 2-dimensional array, or a table with rows and columns.
3. Since the “Response Id” column in the DataFrame has no effect on classification, you should
drop it.
4. The most important column in the DataFrame is “Personality” column which you will try to
predict. This column includes string values. String values are not acceptable for Machine Learning.
Therefore, you will encode them with integers given below.
PERSONALITY ENCODE
ESTJ 0
ENTJ 1
ESFJ 2
ENFJ 3
ISTJ 4
ISFJ 5
INTJ 6
INFJ 7
ESTP 8
ESFP 9
ENTP 10
ENFP 11
ISTP 12
ISFP 13
INTP 14
INFP 15
Fall 2022
BBM 103: Introduction to Programming Laboratory I
PROGRAMMING ASSIGNMENT 5 8
5. Transform your DataFrame to the NumPy array collection.
6. Split your data into two sets which are predictor (independent variables) and target (dependent
variable)
7. For the test samples,
• predict their classes using k-NN. (with feature normalization and without feature normalization)
• predict their classes using weighted k-NN. (with feature normalization and without feature
normalization) (BONUS)
8. Compute and report your accuracy, precision and recall of your different k-NN and weighted kNN models with different k parameters (You must experiment with this k parameters: (1,3,5,7,9))
on 5-fold cross validation. (https://machinelearningmastery.com/k-fold-cross-validation/)
9. Report your findings in” Error Analysis for Classification” section.
4. GRADING POLICY
TASK POINT
Data Preparation 5
Data Scaling 10
KNN Implementation 20
Performance Evaluation Metrics Implementation 15
Execution of KNN and Performance Evaluation Metrics with standard KNN algorithm 20
Execution of KNN and Performance Evaluation Metrics with weighted KNN algorithm
(BONUS)
10
Cross Validation 5
Coding Standard/Clean Code 10
Error Analysis for Classification 15
TOTAL (with bonus) 110
5. IMPORTANT NOTES
• The assignment must be original, individual work. Downloaded, Modified, Duplicate or very similar
source codes are all going to be considered as cheating.
• We will be using a software similarity program to identify the cases of possible plagiarism. Our
department takes the act of plagiarism very seriously. You cannot share algorithms or source code.
All work must be individual! Those caught plagiarizing (both originators and copiers) will be
sanctioned.
• You can ask your questions via Piazza and you are supposed to be aware of everything discussed
on Piazza. General discussion of the problem is allowed, but DO NOT SHARE answers, algorithms,
source codes, and reports.
Fall 2022
BBM 103: Introduction to Programming Laboratory I
PROGRAMMING ASSIGNMENT 5 9
• In the “Coding Standard/Clean Code” task of the grading policy section, it will be taken into
account whether unnecessary code fragments are used whether the abstraction level is regular,
and whether your code is understandable.
• In the “Coding Standard/Clean Code” task of the grading policy section, it will be taken into
account whether variable names or function names are used in the correct format.
• Do not miss the submission deadline.
• Don’t forget to write comments on your codes when necessary.
• Save all your work until the assignment is graded.
• You must submit your work with the file as stated below:
assignment5.ipynb
• URL for deadline: https://classroom.github.com/a/h8LH2W17
• URL for the first late day: https://classroom.github.com/a/iGog0OrA
• URL for the second late day: https://classroom.github.com/a/jIdj_2Nr
• URL for the third late day: https://classroom.github.com/a/3uZ86hxg