Use of Confusion Matrix in Cyber Security
--
When we get the data, after data cleaning, pre-processing and wrangling, the first step we do is to feed it to an outstanding model and of course, get output in probabilities. But hold on! How in the hell can we measure the effectiveness of our model. Better the effectiveness, better the performance and that’s exactly what we want. And it is where the Confusion matrix comes into the limelight. Confusion Matrix is a performance measurement for machine learning classification.
What is Confusion Matrix?
The million dollar question — what, after all, is a confusion matrix?
A Confusion matrix is an N x N matrix used for evaluating the performance of a classification model, where N is the number of target classes. The matrix compares the actual target values with those predicted by the machine learning model. This gives us a holistic view of how well our classification model is performing and what kinds of errors it is making.
For a binary classification problem, we would have a 2 x 2 matrix as shown below with 4 values:
Let’s decipher the matrix:
- The target variable has two values: Positive or Negative
- The columns represent the actual values of the target variable
- The rows represent the predicted values of the target variable
But wait — what’s TP, FP, FN and TN here? That’s the crucial part of a confusion matrix. Let’s understand each term below.
Understanding True Positive, True Negative, False Positive and False Negative in a Confusion Matrix
True Positive (TP)
- The predicted value matches the actual value
- The actual value was positive and the model predicted a positive value
True Negative (TN)
- The predicted value matches the actual value
- The actual value was negative and the model predicted a negative value
False Positive (FP) — Type 1 error
- The predicted value was falsely predicted
- The actual value was negative but the model predicted a positive value
- Also known as the Type 1 error
False Negative (FN) — Type 2 error
- The predicted value was falsely predicted
- The actual value was positive but the model predicted a negative value
- Also known as the Type 2 error
Let me give you an example to better understand this. Suppose we had a classification dataset with 1000 data points. We fit a classifier on it and get the below confusion matrix:
The different values of the Confusion matrix would be as follows:
- True Positive (TP) = 560; meaning 560 positive class data points were correctly classified by the model
- True Negative (TN) = 330; meaning 330 negative class data points were correctly classified by the model
- False Positive (FP) = 60; meaning 60 negative class data points were incorrectly classified as belonging to the positive class by the model
- False Negative (FN) = 50; meaning 50 positive class data points were incorrectly classified as belonging to the negative class by the model
This turned out to be a pretty decent classifier for our dataset considering the relatively larger number of true positive and true negative values.
What is Cyber Security?
Cybersecurity is the practice of protecting systems, networks, and programs from digital attacks, here digital attacks can be stealing information or spying on another system, and much more. There are cybersecurity experts present whose work is to protect users or prevent the digital attack, here digital attack is also known as cybercrime.
Cyber Crimes are increasing day by day, here is some example of cybercrime that took place in 2021 and this will give you the idea that how important is cybersecurity :
- Australian broadcaster Channel Nine was hit by a cyberattack on 28th March 2021, which rendered the channel unable to air its Sunday news bulletin and several other shows.
- In March 2021, the London-based Harris Federation suffered a ransomware attack and was forced to “temporarily” disable the devices and email systems of all the 50 secondary and primary academies it manages. This resulted in over 37,000 students being unable to access their coursework and correspondence.
- A cybercriminal attempted to poison the water supply in Florida and managed by increasing the amount of sodium hydroxide to a potentially dangerous level.
- Acer suffered a ransomware attack and was asked to pay a ransom of $50 million, which made the record of the largest known ransom to date
By going through the cyber crimes that happened in 2021, we can understand that how important is cybersecurity.
Now we will look at a case study to summarize what we have learnt
Cyber Attack Detection Using Support Vector Machine(SVM)
Basically the cyber attack detection is a classification problem, in which we classify the normal pattern from the abnormal pattern (attack) of the system. Subset selection decision fusion method plays a key role in cyber attack detection. It has been shown that redundant and/or irrelevant features may severely affect the accuracy of learning algorithms. The SDF is very powerful and popular data mining algorithm for decision-making and classification problems. It has been using in many real life applications like medical diagnosis, radar signal classification, weather prediction, credit approval, and fraud detection etc.
KDD CUP ‘’99 Data Set Description
To check performance of the proposed algorithm for distributed cyber attack detection and classification, we can evaluate it practically using KDD’99 intrusion detection datasets. In KDD99 dataset these four attack classes (DoS, U2R,R2L, and probe) are divided into 22 different attack classes that tabulated in Table I. The 1999 KDD datasets are divided into two parts: the training dataset and the testing dataset. The testing dataset contains not only known attacks from the training data but also unknown attacks. Since 1999, KDD’99 has been the most wildly used data set for the evaluation of anomaly detection methods. This data set is prepared by Stolfo et al. and is built based on the data captured in DARPA’98 IDS evaluation program . DARPA’98 is about 4 gigabytes of compressed raw (binary) tcpdump data of 7 weeks of network traffic, which can be processed into about 5 million connection records, each with about 100 bytes. For each TCP/IP connection, 41 various quantitative (continuous data type) and qualitative (discrete data type) features were extracted among the 41 features, 34 features (numeric) and 7 features (symbolic). To analysis the different results, there are standard metrics that have been developed for evaluating network intrusion detections. Detection Rate (DR) and false alarm rate are the two most famous metrics that have already been used. DR is computed as the ratio between the number of correctly detected attacks and the total number of attacks, while false alarm (false positive) rate is computed as the ratio between the number of normal connections that is incorrectly misclassified as attacks and the total number of normal connections.
In the KDD Cup 99, the criteria used for evaluation of the participant entries is the Cost Per Test (CPT) computed using the confusion matrix and a given cost matrix. A Confusion Matrix (CM) is a square matrix in which each column corresponds to the predicted class, while rows correspond to the actual classes. An entry at row i and column j, CM (i, j), represents the number of misclassified instances that originally belong to class i, although incorrectly identified as a member of class j. The entries of the primary diagonal, CM (i, i), stand for the number of properly detected instances. Cost matrix is similarly defined, as well, and entry C (i, j) represents the cost penalty for misclassifying an instance belonging to class i into class j.
- True Positive (TP): The amount of attack detected when it is actually attack.
- True Negative (TN): The amount of normal detected when it is actually normal.
- False Positive (FP): The amount of attack detected when it is actually normal (False alarm).
- False Negative (FN): The amount of normal detected when it is actually attack.
Confusion matrix contains information actual and predicted classifications done by a classifier. The performance of cyber attack detection system is commonly evaluated using the data in a matrix.
Thank You!!