Survey-Based Insights into the Replication Crisis and the 3R in Software Engineering
Abstract
Context: Efforts to improve reproducibility and research credibility have gained relevance in multiple fields, including Software Engineering (SE), where the 3R practices (Repeatability, Reproducibility, and Replicability) are essential to ensure the reliability of empirical studies. Despite growing interest in Open Science, concerns about a Replication Crisis persist. Objectives: To assess the perceptions of SE researchers of the Replication Crisis and 3R practices, identify good practices, barriers, and facilitators to reproducible research, and evaluate the community’s acceptance of the ACM’s standardized definitions of 3R. Method: We conducted a survey adapted from Baker [5], targeting authors of SE studies related to replication. From a list of 1,061 researchers, we received 101 responses. The questionnaire combined Likert-scale and open-ended questions. Responses were analyzed using descriptive statistics and Reflexive Thematic Analysis. Results: Most respondents acknowledged the importance of 3R practices. On average 84.5% agreed with the ACM definitions, participants raised concerns about their clarity and applicability, especially to qualitative research. 74.3% recognized the existence of a Replication Crisis in SE. The key challenges reported include a lack of protocols, selective reporting, data unavailability, and pressure to publish. Positive actions included using containers, version control, artifact sharing, and Open Science practices. However, participants noted that cultural and institutional incentives for reproducibility are limited. Conclusion: Although SE researchers support the principles of 3R practices and recognize ongoing challenges, uncertainty persists about the scope and solutions of the crisis. This study highlights the need for more precise terminology, better reporting standards, and greater institutional support to promote reproducibility, transparency, and research integrity in SE.
Keywords:
Replication Crisis, Repeatability, Reproducibility, Replicability, 3R, Software Engineering, Open Science
References
ACM. 2020. Artifact Review and Badging - Current. [link]
ACM. 2020. Artifact Review and Badging – Version 1.0 (not current). [link]
Carlos E. Anchundia and Efraín R. Fonseca. 2020. Resources for Reproducibility of Experiments in Empirical Software Engineering: Topics Derived From a Secondary Study. IEEE Access 8 (2020). DOI: 10.1109/ACCESS.2020.2964587
Benjamin Antunes and David R.C. Hill. 2024. Reproducibility, Replicability and Repeatability: A survey of reproducible research with a focus on high performance computing. Computer Science Review 53 (2024), 100655. DOI: 10.1016/j.cosrev.2024.100655
Monya Baker. 2016. 1,500 scientists lift the lid on reproducibility. Nature 533 (May 2016), 452–454. DOI: 10.1038/533452a
Maria Teresa Baldassarre, Jeffrey Carver, Oscar Dieste, and Natalia Juristo. 2014. Replication types: towards a shared taxonomy. In Proceedings of the 18th International Conference on Evaluation and Assessment in Software Engineering (London, England, United Kingdom) (EASE ’14). Association for Computing Machinery, New York, NY, USA, Article 18, 4 pages. DOI: 10.1145/2601248.2601299
Timo Balz and Fabio Rocca. 2020. Reproducibility and Replicability in SAR Remote Sensing. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing 13 (2020), 3834–3843. DOI: 10.1109/JSTARS.2020.3005912
Lorena A. Barba. 2018. Terminologies for Reproducible Research. arXiv:1802.03311 [cs.DL] [link]
Begley and L.M. Ellis. 2012. Drug development: Raise standards for preclinical cancer research. Nature 483, 7391 (2012). DOI: 10.1038/483531a
Roberta M. M. Bezerra, Fabio Q. B. da Silva, Anderson M. Santana, Cleyton V. C. Magalhaes, and Ronnie E. S. Santos. 2015. Replication of Empirical Studies in Software Engineering: An Update of a Systematic Mapping Study. In 2015 ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM). 1–4. DOI: 10.1109/ESEM.2015.7321213
Alex Borges, Waldemar Ferreira, Emanoel Barreiros, Adauto Almeida, Liliane Fonseca, Eudis Teixeira, Diogo Silva, Aline Alencar, and Sergio Soares. 2015. Support mechanisms to conduct empirical studies in software engineering: a systematic mapping study. In Proceedings of the 19th International Conference on Evaluation and Assessment in Software Engineering (Nanjing, China) (EASE ’15). Association for Computing Machinery, New York, NY, USA, Article 22, 14 pages. DOI: 10.1145/2745802.2745823
Virginia Braun and Victoria Clarke and. 2019. Reflecting on reflexive thematic analysis. Qualitative Research in Sport, Exercise and Health 11, 4 (2019), 589–597. DOI: 10.1080/2159676X.2019.1628806 arXiv: DOI: 10.1080/2159676X.2019.1628806
Virginia Braun and Victoria Clarke. 2006. Using thematic analysis in psychology. Qualitative Research in Psychology 3 (01 2006), 77–101. DOI: 10.1191/1478088706qp063oa
Kelly D. Cobey, Sanam Ebrahimzadeh, Matthew J. Page, Robert T. Thibault, Phi-Yen Nguyen, Farah Abu-Dalfa, and David Moher. 2024. Biomedical researchers’ perspectives on the reproducibility of research. PLOS Biology 22, 11 (11 2024), 1–15. DOI: 10.1371/journal.pbio.3002870
Andy Cockburn, Pierre Dragicevic, Lonni Besançon, and Carl Gutwin. 2020. Threats of a replication crisis in empirical computer science. Commun. ACM 63, 8 (2020). DOI: 10.1145/3360311
K. Bretonnel Cohen, Jingbo Xia, Pierre Zweigenbaum, Tiffany Callahan, Orin Hargraves, Foster Goss, Nancy Ide, Aurélie Névéol, Cyril Grouin, and Lawrence E. Hunter. 2018. Three Dimensions of Reproducibility in Natural Language Processing. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Nicoletta Calzolari, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Koiti Hasida, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis, and Takenobu Tokunaga (Eds.). European Language Resources Association (ELRA), Miyazaki, Japan. [link]
Daniel Amador Dos Santos, Eduardo Santana de Almeida, and Iftekhar Ahmed. 2022. Investigating replication challenges through multiple replications of an experiment. Information and Software Technology 147 (2022), 106870. DOI: 10.1016/j.infsof.2022.106870
Larissa Falcao, Waldemar Ferreira, Alex Borges, Vilmar Nepomuceno, Sergio Soares, and Maria Teresa Baldassare. 2015. An Analysis of Software Engineering Experiments Using Human Subjects. In 2015 ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM). 1–4. DOI: 10.1109/ESEM.2015.7321185
Steven N. Goodman, Daniele Fanelli, and John P. A. Ioannidis. 2016. What does research reproducibility mean? Science Translational Medicine 8, 341 (2016), 341ps12–341ps12. DOI: 10.1126/scitranslmed.aaf5027 arXiv: [link]
Ben Hermann, Stefan Winter, and Janet Siegmund. 2020. Community expectations for research artifacts and evaluation processes. In Proceedings of the 28th ACM Joint Meeting on ESEC/FSE. ACM. DOI: 10.1145/3368089.3409767
Matthew Hutson. 2018. Artificial intelligence faces reproducibility crisis. Science 359, 6377 (2018), 725–726. DOI: 10.1126/science.359.6377.725 arXiv: [link]
John Ioannidis. 2005. Why Most Published Research Findings Are False. PLoS medicine 2 (2005). DOI: 10.1371/journal.pmed.0020124
Cleyton V.C. de Magalhães and Fabio Q.B. da Silva. 2013. Towards a Taxonomy of Replications in Empirical Software Engineering Research: A Research Proposal. In 2013 3rd International Workshop on Replication in Empirical Software Engineering Research. 50–55. DOI: 10.1109/RESER.2013.10
Jennifer Murphy, Cristian Mesquida, and Joe Warne. 2023. A Survey on the Attitudes Towards and Perception of Reproducibility and Replicability in Sports and Exercise Science. Communications in Kinesiology 1, 5 (May 2023). DOI: 10.51224/cik.2023.53
Brian A. Nosek and Timothy M. Errington. 2020. What is replication? PLOS Biology 18, 3 (03 2020), 1–8. DOI: 10.1371/journal.pbio.3000691
Open Science Collaboration. 2015. Estimating the reproducibility of psychological science. Science 349, 6251 (2015). DOI: 10.1126/science.aac4716
Roger D. Peng. 2011. Reproducible Research in Computational Science. Science 334, 6060 (2011), 1226–1227. DOI: 10.1126/science.1213847 arXiv: [link]
Hans E. Plesser. 2018. Reproducibility vs. Replicability: A Brief History of a Confused Terminology. Frontiers in Neuroinformatics 11 (2018). DOI: 10.3389/fninf.2017.00076
Klaus Schmid, Sascha El-Sharkawy, and Christian Kröher. 2019. Improving Software Engineering Research Through Experimentation Workbenches. DOI: 10.1007/978-3-030-30985-5_6
M. Shepperd. 2016. Replicated results are more trustworthy. In Perspectives on Data Science for Software Engineering, Tim Menzies, Laurie Williams, and Thomas Zimmermann (Eds.). Morgan Kaufmann, Boston. DOI: 10.1016/B978-0-12-804206-9.00052-0
Fabio Silva, Marcos Suassuna, César França, Alicia Grubb, Tatiana Gouveia, Cleviton Monteiro, and Igor Santos. 2012. Replication of empirical studies in software engineering research: A systematic mapping study. Empirical Software Engineering 19 (09 2012). DOI: 10.1007/s10664-012-9227-7
D.I.K. Sjoeberg, J.E. Hannay, O. Hansen, V.B. Kampenes, A. Karahasanovic, N.-K. Liborg, and A.C. Rekdal. 2005. A survey of controlled experiments in software engineering. IEEE Transactions on Software Engineering 31, 9 (2005), 733–753. DOI: 10.1109/TSE.2005.97
Eudis Teixeira, Liliane Fonseca, and Sergio Soares. 2018. Threats to validity in controlled experiments in software engineering: what the experts say and why this is relevant. In Proceedings of the XXXII Brazilian Symposium on Software Engineering (Sao Carlos, Brazil) (SBES ’18). Association for Computing Machinery, New York, NY, USA, 52–61. DOI: 10.1145/3266237.3266264
ChatWacharamanotham, Lukas Eisenring, Steve Haroz, and Florian Echtler. 2020. Transparency of CHI Research Artifacts: Results of a Self-Reported Survey. In Proceedings of the 2020 CHI. ACM. DOI: 10.1145/3313831.3376448
ACM. 2020. Artifact Review and Badging – Version 1.0 (not current). [link]
Carlos E. Anchundia and Efraín R. Fonseca. 2020. Resources for Reproducibility of Experiments in Empirical Software Engineering: Topics Derived From a Secondary Study. IEEE Access 8 (2020). DOI: 10.1109/ACCESS.2020.2964587
Benjamin Antunes and David R.C. Hill. 2024. Reproducibility, Replicability and Repeatability: A survey of reproducible research with a focus on high performance computing. Computer Science Review 53 (2024), 100655. DOI: 10.1016/j.cosrev.2024.100655
Monya Baker. 2016. 1,500 scientists lift the lid on reproducibility. Nature 533 (May 2016), 452–454. DOI: 10.1038/533452a
Maria Teresa Baldassarre, Jeffrey Carver, Oscar Dieste, and Natalia Juristo. 2014. Replication types: towards a shared taxonomy. In Proceedings of the 18th International Conference on Evaluation and Assessment in Software Engineering (London, England, United Kingdom) (EASE ’14). Association for Computing Machinery, New York, NY, USA, Article 18, 4 pages. DOI: 10.1145/2601248.2601299
Timo Balz and Fabio Rocca. 2020. Reproducibility and Replicability in SAR Remote Sensing. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing 13 (2020), 3834–3843. DOI: 10.1109/JSTARS.2020.3005912
Lorena A. Barba. 2018. Terminologies for Reproducible Research. arXiv:1802.03311 [cs.DL] [link]
Begley and L.M. Ellis. 2012. Drug development: Raise standards for preclinical cancer research. Nature 483, 7391 (2012). DOI: 10.1038/483531a
Roberta M. M. Bezerra, Fabio Q. B. da Silva, Anderson M. Santana, Cleyton V. C. Magalhaes, and Ronnie E. S. Santos. 2015. Replication of Empirical Studies in Software Engineering: An Update of a Systematic Mapping Study. In 2015 ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM). 1–4. DOI: 10.1109/ESEM.2015.7321213
Alex Borges, Waldemar Ferreira, Emanoel Barreiros, Adauto Almeida, Liliane Fonseca, Eudis Teixeira, Diogo Silva, Aline Alencar, and Sergio Soares. 2015. Support mechanisms to conduct empirical studies in software engineering: a systematic mapping study. In Proceedings of the 19th International Conference on Evaluation and Assessment in Software Engineering (Nanjing, China) (EASE ’15). Association for Computing Machinery, New York, NY, USA, Article 22, 14 pages. DOI: 10.1145/2745802.2745823
Virginia Braun and Victoria Clarke and. 2019. Reflecting on reflexive thematic analysis. Qualitative Research in Sport, Exercise and Health 11, 4 (2019), 589–597. DOI: 10.1080/2159676X.2019.1628806 arXiv: DOI: 10.1080/2159676X.2019.1628806
Virginia Braun and Victoria Clarke. 2006. Using thematic analysis in psychology. Qualitative Research in Psychology 3 (01 2006), 77–101. DOI: 10.1191/1478088706qp063oa
Kelly D. Cobey, Sanam Ebrahimzadeh, Matthew J. Page, Robert T. Thibault, Phi-Yen Nguyen, Farah Abu-Dalfa, and David Moher. 2024. Biomedical researchers’ perspectives on the reproducibility of research. PLOS Biology 22, 11 (11 2024), 1–15. DOI: 10.1371/journal.pbio.3002870
Andy Cockburn, Pierre Dragicevic, Lonni Besançon, and Carl Gutwin. 2020. Threats of a replication crisis in empirical computer science. Commun. ACM 63, 8 (2020). DOI: 10.1145/3360311
K. Bretonnel Cohen, Jingbo Xia, Pierre Zweigenbaum, Tiffany Callahan, Orin Hargraves, Foster Goss, Nancy Ide, Aurélie Névéol, Cyril Grouin, and Lawrence E. Hunter. 2018. Three Dimensions of Reproducibility in Natural Language Processing. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Nicoletta Calzolari, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Koiti Hasida, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis, and Takenobu Tokunaga (Eds.). European Language Resources Association (ELRA), Miyazaki, Japan. [link]
Daniel Amador Dos Santos, Eduardo Santana de Almeida, and Iftekhar Ahmed. 2022. Investigating replication challenges through multiple replications of an experiment. Information and Software Technology 147 (2022), 106870. DOI: 10.1016/j.infsof.2022.106870
Larissa Falcao, Waldemar Ferreira, Alex Borges, Vilmar Nepomuceno, Sergio Soares, and Maria Teresa Baldassare. 2015. An Analysis of Software Engineering Experiments Using Human Subjects. In 2015 ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM). 1–4. DOI: 10.1109/ESEM.2015.7321185
Steven N. Goodman, Daniele Fanelli, and John P. A. Ioannidis. 2016. What does research reproducibility mean? Science Translational Medicine 8, 341 (2016), 341ps12–341ps12. DOI: 10.1126/scitranslmed.aaf5027 arXiv: [link]
Ben Hermann, Stefan Winter, and Janet Siegmund. 2020. Community expectations for research artifacts and evaluation processes. In Proceedings of the 28th ACM Joint Meeting on ESEC/FSE. ACM. DOI: 10.1145/3368089.3409767
Matthew Hutson. 2018. Artificial intelligence faces reproducibility crisis. Science 359, 6377 (2018), 725–726. DOI: 10.1126/science.359.6377.725 arXiv: [link]
John Ioannidis. 2005. Why Most Published Research Findings Are False. PLoS medicine 2 (2005). DOI: 10.1371/journal.pmed.0020124
Cleyton V.C. de Magalhães and Fabio Q.B. da Silva. 2013. Towards a Taxonomy of Replications in Empirical Software Engineering Research: A Research Proposal. In 2013 3rd International Workshop on Replication in Empirical Software Engineering Research. 50–55. DOI: 10.1109/RESER.2013.10
Jennifer Murphy, Cristian Mesquida, and Joe Warne. 2023. A Survey on the Attitudes Towards and Perception of Reproducibility and Replicability in Sports and Exercise Science. Communications in Kinesiology 1, 5 (May 2023). DOI: 10.51224/cik.2023.53
Brian A. Nosek and Timothy M. Errington. 2020. What is replication? PLOS Biology 18, 3 (03 2020), 1–8. DOI: 10.1371/journal.pbio.3000691
Open Science Collaboration. 2015. Estimating the reproducibility of psychological science. Science 349, 6251 (2015). DOI: 10.1126/science.aac4716
Roger D. Peng. 2011. Reproducible Research in Computational Science. Science 334, 6060 (2011), 1226–1227. DOI: 10.1126/science.1213847 arXiv: [link]
Hans E. Plesser. 2018. Reproducibility vs. Replicability: A Brief History of a Confused Terminology. Frontiers in Neuroinformatics 11 (2018). DOI: 10.3389/fninf.2017.00076
Klaus Schmid, Sascha El-Sharkawy, and Christian Kröher. 2019. Improving Software Engineering Research Through Experimentation Workbenches. DOI: 10.1007/978-3-030-30985-5_6
M. Shepperd. 2016. Replicated results are more trustworthy. In Perspectives on Data Science for Software Engineering, Tim Menzies, Laurie Williams, and Thomas Zimmermann (Eds.). Morgan Kaufmann, Boston. DOI: 10.1016/B978-0-12-804206-9.00052-0
Fabio Silva, Marcos Suassuna, César França, Alicia Grubb, Tatiana Gouveia, Cleviton Monteiro, and Igor Santos. 2012. Replication of empirical studies in software engineering research: A systematic mapping study. Empirical Software Engineering 19 (09 2012). DOI: 10.1007/s10664-012-9227-7
D.I.K. Sjoeberg, J.E. Hannay, O. Hansen, V.B. Kampenes, A. Karahasanovic, N.-K. Liborg, and A.C. Rekdal. 2005. A survey of controlled experiments in software engineering. IEEE Transactions on Software Engineering 31, 9 (2005), 733–753. DOI: 10.1109/TSE.2005.97
Eudis Teixeira, Liliane Fonseca, and Sergio Soares. 2018. Threats to validity in controlled experiments in software engineering: what the experts say and why this is relevant. In Proceedings of the XXXII Brazilian Symposium on Software Engineering (Sao Carlos, Brazil) (SBES ’18). Association for Computing Machinery, New York, NY, USA, 52–61. DOI: 10.1145/3266237.3266264
ChatWacharamanotham, Lukas Eisenring, Steve Haroz, and Florian Echtler. 2020. Transparency of CHI Research Artifacts: Results of a Self-Reported Survey. In Proceedings of the 2020 CHI. ACM. DOI: 10.1145/3313831.3376448
Published
2025-09-22
How to Cite
AZEVEDO, Ivanildo; VASCONCELOS, Ana Paula; TEIXEIRA, Eudis; SOARES, Sergio.
Survey-Based Insights into the Replication Crisis and the 3R in Software Engineering. In: BRAZILIAN SYMPOSIUM ON SOFTWARE ENGINEERING (SBES), 39. , 2025, Recife/PE.
Anais [...].
Porto Alegre: Sociedade Brasileira de Computação,
2025
.
p. 405-415.
ISSN 2833-0633.
DOI: https://doi.org/10.5753/sbes.2025.9967.
