Model Construction
Training and Dataset Evaluation
Two factors related to the dataset used, greatly affect the quality of defect prediction:
Size of the dataset is important, because in machine learning, effectiveness of the process in most cases is proportional to the size of the dataset: the larger the dataset, the greater the efficiency of the machine learning mechanism.
Number of defects has a direct relation to the possible occurrence of a class imbalance problem – such as when the total number of data instances from a single class of data (in this case defective) is far less than the total number of another class of instances (non-defective). If a relatively low number of defects were detected for a relatively large application in a particular release, we should expect a large class imbalance.
Considering the above factors, we analyzed each available release for its size and number of defects registered. For the size determinant, we selected the number of separate code modules (Java classes – not to be confused with data classes). To obtain the number of modules per each release, code metrics had to be collected. Collecting code metrics has to be conducted during the process of building projects (in this case by Maven tool). To minimize impact on the Texas project team work, we decided to copy the source code and build each release, project by project, locally within the Eclipse Integrated Development Environment. To collect the metrics during the build, Eclipse Metrics 2 tool was used. Eclipse Metrics 2 is a CPL-license based Free Software tool, which works as an Eclipse IDE plugin \cite{Metrics2Site} and was created as continuation of Eclipse Metrics – original metrics collection Eclipse plugin \cite{MetricsSite}. Eclipse Metrics 2 permits the collection of different kinds of code metrics (Table \ref{tab:eclipse_metrics}) and exports them to an XML file. The metrics data can then be read by a dedicated DePress node (also called Eclipse Metrics) and converted into DePress’ internal data format.