Understanding and Detecting Harmful Code

Rodrigo Lima; Jairo Souza; Baldoino Fonseca; Leopoldo Teixeira; Rohit Gheyi; Márcio Ribeiro; Alessandro Garcia; Rafael de Mello

Rodrigo Lima UFPE
Jairo Souza UFPE
Baldoino Fonseca UFAL
Leopoldo Teixeira UFPE
Rohit Gheyi UFCG
Márcio Ribeiro UFAL
Alessandro Garcia PUC-Rio
Rafael de Mello CEFET-RJ

Resumo

Code smells typically indicate poor design implementation and choices that may degrade software quality. Hence, they need to be carefully detected to avoid such poor design. In this context, some studies try to understand the impact of code smells on the software quality, while others propose rules or machine learning-based techniques to detect code smells. However, none of those studies or techniques focus on analyzing code snippets that are really harmful to software quality. This paper presents a study to understand and classify code harmfulness. We analyze harmfulness in terms of CLEAN, SMELLY, BUGGY, and HARMFUL code. By HARMFUL CODE, we define a SMELLY code element having one or more bugs reported. These bugs may have been fixed or not. Thus, the incidence of HARMFUL CODE may represent a increased risk of introducing new defects and/or design problems during its fixing. We perform our study with 22 smell types, 803 versions of 13 open-source projects, 40,340 bugs and 132,219 code smells. The results show that even though we have a high number of code smells, only 0.07% of those smells are harmful. The Abstract Function Call From Constructor is the smell type more related to HARMFUL CODE. To cross-validate our results, we also perform a survey with 60 developers. Most of them (98%) consider code smells harmful to the software, and 85% of those developers believe that code smells detection tools are important. But, those developers are not concerned about selecting tools that are able to detect HARMFUL CODE. We also evaluate machine learning techniques to classify code harmfulness: they reach the effectiveness of at least 97% to classify HARMFUL CODE. While the Random Forest is effective in classifying both SMELLY and HARMFUL CODE, the Gaussian Naive Bayes is the less effective technique. Our results also suggest that both software and developers' metrics are important to classify HARMFUL CODE.

Palavras-chave: Code Smells, Software Quality, Machine Learning