The KNIME framework has a built-in Feature Selection functionality based on the backward elimination algorithm \cite{KNIMEDocumentation}. It consists of a loop constructed by two nodes (Figure \ref{fig:feature_sel}): Backward Feature Elimination Start and Backward Feature Elimination End. In the loop created by these nodes, the appropriate machine learning mechanism should be placed, along with any additional supporting nodes, if needed. In the particular example shown, Naive Bayes-based classifier nodes are used – respectively: Learner and Predictor. To divide input data for training and validation sets, node Partitioning is used.
The backward elimination approach used in KNIME framework is carried out in \(\frac{n\times(n+1)}{2}-1\) iterations, where \(n\) is total number of features (columns) in the input dataset (input table), and in this case the total number of types of metrics collected (\(n=10\)):
  1. In the first iteration, the loop is executed with all features (columns): the dataset is divided into two sets, a model is created by the Learner node using the first set, then validated by the Predictor node using the second set;
  2. In the next nine (\(n-1\)) iterations each input column is omitted once. Model creation and its validation is performed for each iteration and prediction results are collected;
  3. The Backward Feature Elimination End node discards the column that influenced the result the least;
  4. The process repeats until one feature (column) is left.
Then the Backward Feature Elimination Filter node is used to filter the actual dataset, using the best feature combination found as a result of the above process.

Prediction Models

Due to fact that DePress, tool used in our research is based on data mining framework KNIME, various types of fault-prone module predictors can be used for the purpose of research. Since prediction results are categorical (faulty or not-faulty), we decided to test classifiers often used in software defect prediction \cite{Hall2012, Moser2008, Khosh1995, Selby1988}, which are available in the basic package of KNIME:
More information about build-in KNIME classifiers can be found in KNIME’ documentation  \cite{KNIMEDocumentation}.
Prediction results – modules marked as defect-prone or non-defect prone, can be compared against actual data describing defect-prone module distribution and were used to build the confusion matrix (Table \ref{tab:conf_matrix}) – a commonly used tool for performance comparison across categorical studies \cite{Hall2012}.