loading page

UNSW-NB15 Computer Security Dataset: Analysis through Visualization
  • Zeinab Zoghi,
  • Gursel Serpen
Zeinab Zoghi
The University of Toledo

Corresponding Author:[email protected]

Author Profile
Gursel Serpen
The University of Toledo
Author Profile

Abstract

Class imbalance refers to a major issue in data mining where data with unequal class distribution can deteriorate classification performance. The problem can become serious if multiple classes are assigned to a single data point representing data overlap. This study aims to provide a visual analysis of the UNSW-NB15 dataset to offer a deep insight into the intricacies of the dataset and the issues that may lead the data-driven models to demonstrate poor performance. A variety of visualization methods such as bar chart, 3D and 2D scatter plots, intercluster distance map, and parallel coordinate diagram are used to evaluate the data for imbalanced and overlapping issues. Several scalers and data transformation methods are implemented to address the overlapping issue and the distance between class centroids was measured by the Mahalanobis distance metric. The largest distance between the class centroids represents the high efficiency of the corresponding scalers and data transformers in dealing with the data overlap. The results reveal that the robust scaler can address the data overlap in binary class classification and the min-max scaler and power transformer can perform effectively in multi-class classification with regard to dealing with the overlapping issue.