Code metrics collected for each code module were classified in accordance with the applied approach – “high risk” modules were marked as “1” and those not belong to this group as “0”. Ratio of modules labeled as “1” to those labeled as “0” was 1:15,52, which indicates a class imbalance problem.
Categorized data was divided into two equal sets by stratified sampling. One of the sets was stored for validation of created prediction models, the second was used for preparation of samples based on three selected classifiers – Naive Bayes, Decision Tree and Probabilistic Neural Network.
For each classifier, four different experimental setup preparations were possible, thanks to the module-based architecture of the DePress tool:
-
Without feature selection, with class imbalanced dataset,
-
Without feature selection, with class balanced dataset,
-
With feature selection, with class imbalanced dataset,
-
With feature selection, with class balanced dataset.
When needed, feature selection was carried out by KNIME’s functionality presented in section \ref{subsec:predictorvariablesselection} and class balance was achieved by using mechanism presented in section \ref{subsec:objectivevariable}.