Understanding Thresholds of Software Features for Defect Prediction

Geanderson Santos; Adriano Veloso; Eduardo Figueiredo

Geanderson Santos UFMG
Adriano Veloso UFMG
Eduardo Figueiredo UFMG

Resumo

Software defect prediction is a subject of study involving the interplay of the software engineering and machine learning areas. The current literature proposed numerous machine learning models to predict software defects from software data, such as commits and code metrics. However, existing machine learning models are more valuable when we can understand the prediction. Otherwise, software developers cannot reason why a machine learning model made such predictions, generating many questions about the model’s applicability in software projects. As explainable machine learning models for the defect prediction problem remain a recent research topic, it leaves room for exploration. In this paper, we propose a preliminary analysis of an extensive dataset to predict software defects. The dataset includes 47,618 classes from 53 open-source projects and covers 66 software features related to numerous features of the code. Therefore, we offer contributions on explaining how each selected software feature favors the prediction of software defects in Java projects. Our initial results suggest that developers should keep the values of some specific software features small to avoid software defects. We hope our approach can guide more discussions about explainable machine learning for defect prediction and its impact on software development.

Palavras-chave: explainable machine learning, software features for defect prediction, defect prediction