Hello, in this post, we will use machine learning to predict whether an event is attack or not.
Model Creation
We will be using a dataset readily available on Kaggle (PSSS: The About section provides good documentation about the dataset’s content).
The dataset contains the following columns,
session_id,network_packet_size,protocol_type,login_attempts,session_duration,encryption_used,ip_reputation_score,failed_logins,browser_type,unusual_time_access,attack_detected
Since the dataset contains target data (labelled data), i.e., attack_detected, we proceed with supervised learning.
We import the following libraries,
- Pandas: To read CSV and organize data.
- LabelEncoder: To convert textual data to numbers for the model to understand, i.e., Chrome => 0 for column browser_type.
- train_test_split: To split data into training and testing parts.
- xgboost: A machine learning algorithm to find patterns and predict.
- classification_report and confusion_matrix: To determine accuracy.
The data can now be loaded and shown,
Dropping the column session_id since it had no significance in learning. We converted the textual data (protocol_type, encryption_used, and browser_type) into integers for the algorithm to understand.
We split the data into training and testing using train_test_split, then create an XGBoost classifier. At last, we start the learning process for the features (X_train) and the correct answer (y_train).
To calculate accuracy, we use the test data (X_test) for making a prediction (y_pred) from the trained model and then compare this with the defined result (y_test) within the dataset.
The diagram below helps us to understand the accuracy in a better way.
- The model was good at categorizing events that were benign, with the sole exception for two events.
- On another hand, the model correctly categorized 637 events as attacks (Which were attack) and failed to detect 216 events as attack.
Model Testing
Let’s test the model on new events,
Sample 1
For an input sample event as below,
The model makes the following prediction,
Since the input features were comparatively lower (Benign, With reference to the dataset), it was predicted as No Attack.
Sample 2
We create another sample event with similarities to an event of an attack,
The model makes the following prediction,
The input features browser_type being Unknown, ip_reputation_score was comparatively high, etc. lead to the prediction as Attack.
Conclusion
The model’s overall accuracy is Good, to increase the accuracy, we can increase the size of the dataset where events are correctly categorized, recursively, and again train the model on those new data events.
Source code: Predicting Cyber Attacks
Thank you. We meet next time to make security better. Until then, Sayonara!.