Is your code harmful too? Understanding harmful code through transfer learning
Resumo
Code smells are indicators of poor design implementation and decision-making that can potentially harm the quality of software. Therefore, detecting these smells is crucial to prevent such issues. Some studies aim to comprehend the impact of code smells on software quality, while others propose rules or machine learning-based approaches to identify code smells. Previous research has focused on labeling and analyzing code snippets that significantly impair software quality using machine learning techniques. These snippets are classified as Clean, Smelly, Buggy, and Harmful Code. Harmful Code refers to Smelly code segments that have one or more reported bugs, whether fixed or not. Consequently, the presence of a Harmful Code increases the risk of introducing new defects and/or design issues during the remediation process. While generating useful results for harmful code detection, none of the prior work has considered, through the use of transfer learning, train a model to identify harmful snippets in one programming language and being able to identify similar harmfulness in another programming language. We perform our study on this scope with 5 smell types, 258,035 versions of 23 open-source projects, 8,181 bugs and 11,506 code smells. The findings revealed promising transferability of knowledge between Java and C# in the presence of various code smells types, while C++ exhibited more challenging transferability. Also, our study discovered that a sample size of 32 demonstrated favorable outcomes for most harmful codes, underscoring the efficiency of transfer learning even with limited data.