loading page

On the Relative Value of Imbalanced Learning for Code Smell Detection
  • +3
  • Xiao Yu,
  • Fuyang Li,
  • Kuan Zou,
  • Jacky Keung,
  • Shuo Feng,
  • Yan Xiao
Xiao Yu
Wuhan University of Technology School of Computer Science and Technology

Corresponding Author:[email protected]

Author Profile
Fuyang Li
Wuhan University of Technology School of Computer Science and Technology
Author Profile
Kuan Zou
Wuhan University of Technology School of Computer Science and Technology
Author Profile
Jacky Keung
City University of Hong Kong Department of Computer Science
Author Profile
Shuo Feng
Zhengzhou University
Author Profile
Yan Xiao
National University of Singapore School of Computing
Author Profile

Abstract

Machine learning-based code smell detection has been demonstrated to be a valuable approach for improving software quality and enabling developers to identify problematic patterns in code. However, previous researches have shown that the code smell datasets commonly used to train these models are heavily imbalanced. While some recent studies have explored the use of imbalanced learning techniques for code smell detection, they have only evaluated a limited number of techniques and thus their conclusions about the most effective methods may be biased and inconclusive. To thoroughly evaluate the effect of imbalanced learning techniques on machine learning-based code smell detection, we examine 31 imbalanced learning techniques with seven classifiers to build code smell detection models on four code smell data sets. We employ four evaluation metrics to assess the detection performance with the Wilcoxon signed-rank test and Cliff’s δ. The results show that (1) Not all imbalanced learning techniques significantly improve detection performance, but deep forest significantly outperforms the other techniques on all code smell data sets. (2) SMOTE (Synthetic Minority Over-sampling TEchnique) is not the most effective technique for resampling code smell data sets. (3) The best-performing imbalanced learning techniques and the top-3 data resampling techniques have little time cost for code smell detection. Therefore, we provide some practical guidelines. First, researchers and practitioners should select the appropriate imbalanced learning techniques (e.g., deep forest) to ameliorate the class imbalance problem. In contrast, the blind application of imbalanced learning techniques could be harmful. Then, better data resampling techniques than SMOTE should be selected to preprocess the code smell data sets.
10 Jan 2023Submitted to Software: Practice and Experience
10 Jan 2023Submission Checks Completed
10 Jan 2023Assigned to Editor
30 Jan 2023Review(s) Completed, Editorial Evaluation Pending
01 Feb 2023Reviewer(s) Assigned
01 May 2023Editorial Decision: Revise Major
23 May 20231st Revision Received
29 May 2023Submission Checks Completed
29 May 2023Assigned to Editor
29 May 2023Review(s) Completed, Editorial Evaluation Pending
29 May 2023Reviewer(s) Assigned
13 Jun 2023Editorial Decision: Accept