Datasets for both purposes – training and validation, were constructed by splitting 4.0.0 data, using a built-in DePress’ function called Stratified Sampling, so that after the division, the ratio between code parts classified differently would be preserved in both created datasets.

Objective Variable And Class Imbalance Counteraction

\label{subsec:objectivevariable}
For the purpose of this research, we chose two-value (“0” and “1”) objective variable to distinguish the fault-prone module, where at least one defect occurred (“1”), from the fault-free modules, where no defects were observed (“0”).
To counteract against any possible class imbalance, we decided to randomly remove some of the majority class instances. To follow our initial approach, we used the basic mechanisms built into DePress/KNIME by constructing a workflow as shown in Figure \ref{fig:class_imb_wrkf}. First, the dataset is split into two parts, by classifying rows of two different sets, depending on the objective variable value (“1” or “0”), using the Row Splitter KNIME node. Then, the majority class instances were reduced to reach exactly the same number of instances of minority class, achieved by random sampling done with the Row Sampling node. Finally, two equal size record sets were merged by Concatenate node to make one dataset with an equal ratio between the classes as a result.