Hello, in this post, we will use machine learning to predict whether an event is attack or not.

Model Creation

We will be using a dataset readily available on Kaggle (PSSS: The About section provides good documentation about the dataset’s content).

The dataset contains the following columns,

session_id,network_packet_size,protocol_type,login_attempts,session_duration,encryption_used,ip_reputation_score,failed_logins,browser_type,unusual_time_access,attack_detected

Since the dataset contains target data (labelled data), i.e., attack_detected, we proceed with supervised learning.

We import the following libraries,

Imports
Figure 1: Imports
  1. Pandas: To read CSV and organize data.
  2. LabelEncoder: To convert textual data to numbers for the model to understand, i.e., Chrome => 0 for column browser_type.
  3. train_test_split: To split data into training and testing parts.
  4. xgboost: A machine learning algorithm to find patterns and predict.
  5. classification_report and confusion_matrix: To determine accuracy.

The data can now be loaded and shown,

Read data
Figure 2: Read data

Dropping the column session_id since it had no significance in learning. We converted the textual data (protocol_type, encryption_used, and browser_type) into integers for the algorithm to understand.

Filter and categorize
Figure 3: Filter and categorize

We split the data into training and testing using train_test_split, then create an XGBoost classifier. At last, we start the learning process for the features (X_train) and the correct answer (y_train).

Train and create model
Figure 4: Train and create model

To calculate accuracy, we use the test data (X_test) for making a prediction (y_pred) from the trained model and then compare this with the defined result (y_test) within the dataset.

Calculate accuracy
Figure 5: Calculate accuracy

The diagram below helps us to understand the accuracy in a better way.

Matrix
Figure 6: Matrix
  1. The model was good at categorizing events that were benign, with the sole exception for two events.
  2. On another hand, the model correctly categorized 637 events as attacks (Which were attack) and failed to detect 216 events as attack.

Model Testing

Let’s test the model on new events,

Sample 1

For an input sample event as below,

Sample 1
Figure 7: Sample 1

The model makes the following prediction,

Sample 1 Prediction
Figure 8: Sample 1 Prediction

Since the input features were comparatively lower (Benign, With reference to the dataset), it was predicted as No Attack.

Sample 2

We create another sample event with similarities to an event of an attack,

Sample 2
Figure 9: Sample 2

The model makes the following prediction,

Sample 2 Prediction
Figure 10: Sample 2 Prediction

The input features browser_type being Unknown, ip_reputation_score was comparatively high, etc. lead to the prediction as Attack.

Conclusion

The model’s overall accuracy is Good, to increase the accuracy, we can increase the size of the dataset where events are correctly categorized, recursively, and again train the model on those new data events.

Source code: Predicting Cyber Attacks

Thank you. We meet next time to make security better. Until then, Sayonara!.