Starting from:

$30

Assignment 5: Anomaly Detection

Assignment 5: Anomaly Detection
Objective
In this assignment, you will be instructed to apply unsupervised learning to address a practical problem. For simplicity, we will use intrusion detection as an example. But the principle can be used in many other fields, such as fraud detection and medical care. After completing this assignment, you should be able to answer the following questions:

How to standardize numerical features?
How to transform categorical features into numerical features?
How to derive anomalies from clustering results?
How to tune parameters for unsupervised learning?
The data can be downloaded from A5-data.zip.

Overview
In practice, when you are facing an ML problem, the process of solving it basically consists of two phases: model development and model serving.

In model development, your job is to figure out what's the best algorithms, features, and parameters should be chosen based on historical data. This is often an iterative and off-line process.

Once you develop a satisfactory model, you will need to use the model to serve new requests and make predictions. This is often an on-line process, so you have to think about how to make the predictions as fast as possible and how to efficiently update the model when new data arrive.

In this assignment, you are going to focus on the model development stage. Model serving is also a very important topic. I hope you can learn it by yourself or through your final project. Below are a few good references:

Deploy a Model in Amazon SageMaker
TensorFlow Serving
Deploy models with Azure Machine Learning
Serving of ML models in Kubeflow
MLflow Model Serving on Databricks
Model Development
Suppose you want to develop a model that can detect anomalous connections to your company's server. The server log contains all the information of historical connections; your nice colleague has already helped you to transform the raw log into a collection of feature vectors, where each feature vector characterizes a connection in 40 dimensions, e.g., number of failed login attempts, length (number of seconds) of the connection. Here is one example feature vector:

[udp,SF,0,105,146,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0.00,0.00,0.00,0.00,1.00,0.00,0.00,255,240,0.94,0.01,0.00,0.00,0.00,0.00,0.00,0.00]
Your task is to take these feature vectors as input and develop an unsupervised-learning model to detect anomalous connections. In the lecture, we have gone through this process. In the assignment, you are going to implement three functions: cat2Num, scaleNum, and detect, by doing Tasks A-C, respectively.

# anomaly_detection.py
import pandas as pd


class AnomalyDetection():

    def scaleNum(self, df, indices):
        """
            Write your code!
        """


    def cat2Num(self, df, indices):
        """
            Write your code!
        """


    def detect(self, df, k, t):
        """
            Write your code!
        """



if __name__ == "__main__":
    df = pd.read_csv('logs-features-sample.csv').set_index('id')
    ad = AnomalyDetection()

    df1 = ad.cat2Num(df, [0,1])
    print(df1)

    df2 = ad.scaleNum(df1, [6])
    print(df2)

    df3 = ad.detect(df2, 8, 0.97)
    print(df3)
A toy dataset for testing
To test your program, you can read a toy dataset:

data = [(0, ["http", "udt", 4]), \
            (1, ["http", "udf", 5]), \
            (2, ["http", "tcp", 5]), \
            (3, ["ftp", "icmp", 1]), \
            (4, ["http", "tcp", 4])]

    df = pd.DataFrame(data=data, columns = ["id", "features"])
After running the script, your program should output the followings:

df1 = ad.cat2Num(df, [0,1])

+---+---------------------+
|id |features             |
+---+---------------------+
|0  |[1, 0, 1, 0, 0, 0, 4]|
|1  |[1, 0, 0, 1, 0, 0, 5]|
|2  |[1, 0, 0, 0, 1, 0, 5]|
|3  |[0, 1, 0, 0, 0, 1, 1]|
|4  |[1, 0, 0, 0, 1, 0, 4]|
+---+---------------------+
df2 = ad.scaleNum(df1, [6]):

+---+---------------------------------------+
|id |features                               |
+---+---------------------------------------+
|0  |[1, 0, 1, 0, 0, 0, 0.12171612389003701]|
|1  |[1, 0, 0, 1, 0, 0, 0.7302967433402214] |
|2  |[1, 0, 0, 0, 1, 0, 0.7302967433402214] |
|3  |[0, 1, 0, 0, 0, 1, -1.704025734460517] |
|4  |[1, 0, 0, 0, 1, 0, 0.12171612389003701]|
+---+---------------------------------------+
df3 = ad.detect(df2, 2, 0.9)

+---+---------------------------------------+-----+
|id |features                               |score|
+---+---------------------------------------+-----+
|3  |[0, 1, 0, 0, 0, 1, -1.704025734460517] |1.0  |
+---+----------------+----------------------+-----+
Task A. Categorical Features --> Numerical Features
As you can see from the above feature vector, the first two dimensions in each feature vector are categorical features. For example, the first dimension can be one of the following cases: “http” or “ftp”. You can represent these categorical features using one-hot encoding, e.g., [1,0] for “http” and [0,1] for “ftp”.

In Task A, your job is to implement the cat2Num function.

def cat2Num(self, df, indices):
        """ 
            Input: $df represents a DataFrame with two columns: "id" and "features"
                   $indices represents which dimensions in $features are categorical features, 
                    e.g., indices = [0, 1] denotes that the first two dimensions are categorical features.

            Output: Return a new DataFrame that updates the "features" column with one-hot encoding. 

        """
Note: You have to implement one-hot encoding by yourself rather than use an existing implmentation in a libary.

Task B. Standardize Numerical Features
K-Means utilizes the distance between every two points to do clustering. We don't want the distance to be dominated by some features. To avoid this, we standardize those features that are on a larger scale.

In Task B, your job is to implement the scaleNum function, which standardizes specified features by removing the mean and scaling to unit variance.

def scaleNum(self, df, indices):
        """ 
            Input: $df represents a DataFrame with two columns: "id" and "features"
                   $indices represents which dimensions in $features that need to be standardized

            Output: Return a new DataFrame that updates "features" column with specified features standarized.

        """
Note: You have to implement StandardScaler by yourself rather than use an existing implmentation in a libary

Task C. Detect Anomalies
As you may remember, the intuition of our anomaly detection approach was that clusters with a small number of data points will correspond to attacks or anomalies. We use this intuition to generate a confidence score from the clustering model’s output. The confidence score reflects how much the clustering model believes a data point is an attack or not. Let us assume  x
  is a data point describing a network connection. We can use:

score(x)=Nmax−NxNmax−Nmin
 
to score  x
  as being an anomaly. Note that in this equation,  Nmax
  and  Nmin
  reflect the size of the largest and smallest clusters, respectively.  Nx
  represents the size of the cluster assigned to  x
 . If you check the equation carefully, you will notice that  score(x)=1
  when  x
  is assigned to the smallest cluster and  score(x)
  = 0 when  x
  is assigned to a large cluster.

In Task C, your job is to implement the detect function.

def detect(self, df, k, t):
    """ 
        Input: $df represents a DataFrame with two columns: "id" and "features"
               $k is the number of clusters for K-Means
               $t is the score threshold

        Output: Return a new DataFrame that adds the "score" column into the input $df and then
                removes the rows whose scores are smaller than $t.  
    """
Task D. Hyperparameter Tuning
Hyperparameter tuning plays an essential role in improving model quality. In this assignment, your task is to figure out how to tune hyperparameters for unsupervised learning. There are two parameters that need to be tuned:

k
 : the number of clusters
t
 : the score threshold
They are set to k = 8 and t = 0.97 in the above program. In fact, if you changed them to different values, the result could be quite different. Thus, it is important to know how to tune  k
  and  t
  in practice.

In Task D, imagine you are a data science manager. Please write an email (< 500 words) to tell a junior data scientist (named Nick) how to use Bayesian Optimization to tune the parameters. Your email needs to explain to Nick i) how bayesian optimization works in an intuitive way, and ii) why bayesian optimization could perform better than grid search and random search.

Submission
Implement cat2Num, scaleNum, and detect functions in anomaly_detection.py. Submit your code file (anomaly_detection.py), and your email content (email.pdf) to CourSys.

More products