Brazilian Data Scientists: Revealing their Challenges and Practices on Machine Learning Model Development

João Lucas Correia; Juliana Alves Pereira; Rafael Mello; Alessandro Garcia; Baldoino Fonseca; Márcio Ribeiro; Rohit Gheyi; Marcos Kalinowski; Renato Cerqueira; Willy Tiengo

João Lucas Correia UFAL
Juliana Alves Pereira PUC-Rio
Rafael Mello CEFET-RJ
Alessandro Garcia PUC-Rio
Baldoino Fonseca UFAL
Márcio Ribeiro UFAL
Rohit Gheyi UFCG
Marcos Kalinowski PUC-Rio
Renato Cerqueira IBM Research Brazil
Willy Tiengo UFAL

Resumo

Data scientists often develop machine learning models to solve a variety of problems in the industry and academy. To build these models, these professionals usually perform activities that are also performed in the traditional software development lifecycle, such as eliciting and implementing requirements. One might argue that data scientists could rely on the engineering of traditional software development to build machine learning models. However, machine learning development presents certain characteristics, which may raise challenges that lead to the need for adopting new practices. The literature lacks in characterizing this knowledge from the perspective of the data scientists. In this paper, we characterize challenges and practices addressing the engineering of machine learning models that deserve attention from the research community. To this end, we performed a qualitative study with eight data scientists across five different companies having different levels of experience in developing machine learning models. Our findings suggest that: (i) data processing and feature engineering are the most challenging stages in the development of machine learning models; (ii) it is essential synergy between data scientists and domain experts in most of stages; and (iii) the development of machine learning models lacks the support of a well-engineered process.

Palavras-chave: Software Engineering, Machine Learning, Practitioner, Empirical Study