Prediction Recall
Using the approach described in the previous section, defect prediction was performed and its recall measured (Table \ref{tab:pred_results}) for all four experimental set-ups, classifiers and samples. The best prediction results (highest recall values) were obtained for the imbalanced class sample, but with the feature selection step. At this point, we were able to answer RQ1: What is the highest level of prediction recall achievable by DePress tool in basic configuration, using an industrial project’s data? From all results, the highest recall for highest F-measure values in our experiment (\(Rec\)) got prediction based on the Naive Bayes algorithm:
\begin{equation}
\label{highRec}Rec=0.783\\
\end{equation}
Prediction-based costs simulation
For the purpose of cost simulation in this scenario, where defect prediction is introduced to the project using DePress framework, we agreed that:
Total number of discoverable defects in release 4.0.0 (\ref{htotal}) is a constant value;
Average fixing cost per one defect (Table \ref{tab:costs}) is also true for the considered scenario;
Information on location of “high risk” software modules, with recall \(Rec\), will be available in the first phase of the project;
Ratio (\ref{hrest}) is preserved.
Considering the highest recall value achieved for release 4.0.0 as a result of the prediction models creation (\ref{highRec}) and the total number of discovered defects in that release (\ref{htotal}), based on the proposed strategy (\ref{strategy}) we should expect, that the number of software issues which can be solved by allocation of the best quality assurance practices in the first, development phase of the project is:
\begin{equation}
\label{h1prim}H_{1}^{\prime}{}=0.8\times 0.783\times 837=524\\
\end{equation}
Regarding the number of defects which are expected to be found in later phases of the project (\ref{hrest}):
\begin{equation}
\label{hrestprim}H_{2+3}^{\prime}{}=837-524=313\\
\end{equation}
As we assumed that ratio (\ref{hrest}) is preserved, the number of defects which are expected to be found in the project’s second and third phase (connected) are:
\begin{equation}
\label{h2prim}H_{2}^{\prime}{}=313\times 0.6=188\\
\end{equation} \begin{equation}
\label{h3prim}H_{3}^{\prime}{}=313\times 0.4=125\\
\end{equation}
Considering the above values, we simulated quality assurance costs in the assumption that the machine learning mechanism will be able to point out the “high risk” 20% of software modules with measured recall (\ref{highRec}), and the best quality assurance efforts will be allocated to the development phase to avoid the calculated number of defects (\ref{h1prim}). Results of that simulation are presented in Table \ref{tab:scenar2}.