Reimagining Studies’ Replication: A Validity-Driven Analysis of Threats in Empirical Software Engineering

Abstract


Context: Replication studies play an important role in strengthening the empirical foundations of Software Engineering (SE). However, the existing literature reveals that the reporting of Threats to Validity (TTVs) remains inconsistent or superficial, potentially undermining the reliability of the replication results. Objective: The goal of this study is to analyze how replication studies consider TTVs present in original studies in SE. Method: We conducted a Systematic Literature Review (SLR) that resulted in 83 replication studies published between 2022 and 2024. We analyzed the presence and specificity of TTVs in four validity dimensions (construct, internal, external, and conclusion), considering different research methods and types of replication. Results: Our analysis shows that replication studies in Empirical Software Engineering (ESE) tend to report threats to validity more frequently and in greater detail than original studies, particularly with regard to external and internal validity. Nevertheless, threats related to the validity of the conclusion and construct remain underreported. We observed that controlled experiments generally address the different types of TTVs more comprehensively, whereas surveys and case studies provide more limited coverage. With respect to types of replication, close and differentiated replications are predominant, while conceptual and internal replications remain underexplored in the field. Conclusion: Although there is growing attention to the identification of TTVs in replication studies, reporting remains uneven across validity dimensions and study types. More structured and diverse replication strategies are needed, along with better guidelines to support comprehensive TTV reporting and enhance the rigor and methodological value of replication efforts in ESE.
Keywords: Replication, Software Engineering, Open Science, TTVs, SLR

References

ACM. 2020. Artifact Review and Badging - Current. [link]

Apostolos Ampatzoglou, Stamatia Bibi, Paris Avgeriou, Marijn Verbeek, and Alexander Chatzigeorgiou. 2019. Identifying, Categorizing and Mitigating Threats to Validity in Software Engineering Secondary Studies. Information and Software Technology 106 (02 2019). DOI: 10.1016/j.infsof.2018.10.006

Carlos E. Anchundia and Efraín R. Fonseca. 2020. Resources for Reproducibility of Experiments in Empirical Software Engineering: Topics Derived From a Secondary Study. IEEE Access 8 (2020). DOI: 10.1109/ACCESS.2020.2964587

Maria Teresa Baldassarre, Jeffrey Carver, Oscar Dieste, and Natalia Juristo. 2014. Replication types: towards a shared taxonomy. In Proceedings of the 18th International Conference on Evaluation and Assessment in Software Engineering (London, England, United Kingdom) (EASE ’14). Association for Computing Machinery, New York, NY, USA, Article 18, 4 pages. DOI: 10.1145/2601248.2601299

Márcio Barros and Arilo Neto. 2011. Threats to Validity in Search-based Software Engineering Empirical Studies. RelaTe-DIA 5 (01 2011).

Marvin Muñoz Barón, Marvin Wyrich, Daniel Graziotin, and Stefan Wagner. 2023. Evidence Profiles for Validity Threats in Program Comprehension Experiments. In Proceedings of the 45th International Conference on Software Engineering (Melbourne, Victoria, Australia) (ICSE ’23). IEEE Press, 1907–1919. DOI: 10.1109/ICSE48619.2023.00162

Roberta M. M. Bezerra, Fabio Q. B. da Silva, Anderson M. Santana, Cleyton V. C. Magalhaes, and Ronnie E. S. Santos. 2015. Replication of Empirical Studies in Software Engineering: An Update of a Systematic Mapping Study. In 2015 ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM). 1–4. DOI: 10.1109/ESEM.2015.7321213

Stefan Biffl, Marcos Kalinowski, Fajar Ekaputra, Amadeu Anderlin Neto, Tayana Conte, and Dietmar Winkler. 2014. Towards a semantic knowledge base on threats to validity and control actions in controlled experiments. In Proceedings of the 8th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (Torino, Italy) (ESEM ’14). Association for Computing Machinery, New York, NY, USA, Article 49, 4 pages. DOI: 10.1145/2652524.2652568

Dante Carrizo and Jacqueline Manriquez. 2016. Assessment method of empirical studies in software engineering. In 2016 35th International Conference of the Chilean Computer Science Society (SCCC). 1–12. DOI: 10.1109/SCCC.2016.7836001

Davide Chicco and Giuseppe Jurman. 2020. The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation. BMC Genomics 21, 1 (2020). DOI: 10.1186/s12864-019-6413-7

K. Bretonnel Cohen, Jingbo Xia, Pierre Zweigenbaum, Tiffany Callahan, Orin Hargraves, Foster Goss, Nancy Ide, Aurélie Névéol, Cyril Grouin, and Lawrence E. Hunter. 2018. Three Dimensions of Reproducibility in Natural Language Processing. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Nicoletta Calzolari, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Koiti Hasida, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis, and Takenobu Tokunaga (Eds.). European Language Resources Association (ELRA), Miyazaki, Japan. [link]

Margarita Cruz, Beatriz Bernárdez, Amador Durán, José A. Galindo, and Antonio Ruiz-Cortés. 2020. Replication of Studies in Empirical Software Engineering: A Systematic Mapping Study, From 2013 to 2018. IEEE Access 8 (2020), 26773–26791. DOI: 10.1109/ACCESS.2019.2952191

Liliane da Silva Fonseca, Eudis Oliveira Teixeira, and Sergio Soares. 2019. ValiDEPlan — Validity-Driven Software Engineering Experiments Planning Tool. In Anais Estendidos do X Congresso Brasileiro de Software: Teoria e Prática (Salvador). SBC, Porto Alegre, RS, Brasil, 102–107. DOI: 10.5753/cbsoft_estendido.2019.7665

Steve Easterbrook, Janice Singer, Margaret-Anne Storey, and Daniela Damian. 2008. Selecting Empirical Methods for Software Engineering Research. Springer London, London, 285–311. DOI: 10.1007/978-1-84800-044-5_11

Edison Espinosa, Juan Ferreira, and Henry Chanatasig. 2018. Using Experimental Material Management Tools in Experimental Replication: A Systematic Mapping Study. 252–263. DOI: 10.1007/978-3-319-73450-7_25

Larissa Falcão and Sergio Soares. 2021. Human-Oriented Software Engineering Experiments: The Large Gap in Experiment Reports. In Proceedings of the XXXV Brazilian Symposium on Software Engineering (Joinville, Brazil) (SBES ’21). Association for Computing Machinery, New York, NY, USA, 330–334. DOI: 10.1145/3474624.3474649

Andrea Fasciglione, Maurizio Leotta, and Alessandro Verri. 2022. Reproducibility in Activity Recognition Based on Wearable Devices: a Focus on Used Datasets. In 2022 IEEE International Conference on Systems, Man, and Cybernetics (SMC). 3178–3185. DOI: 10.1109/SMC53654.2022.9945344

Robert Feldt and Ana Magazinius. 2010. Validity Threats in Empirical Software Engineering Research - An Initial Survey. 374–379.

Steven N. Goodman, Daniele Fanelli, and John P. A. Ioannidis. 2016. What does research reproducibility mean? Science Translational Medicine 8, 341 (2016), 341ps12–341ps12. DOI: 10.1126/scitranslmed.aaf5027 arXiv: [link]

Lucas Gren. 2018. Standards of validity and the validity of standards in behavioral software engineering research: the perspective of psychological test theory. In Proceedings of the 12th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (Oulu, Finland) (ESEM ’18). Association for Computing Machinery, New York, NY, USA, Article 55, 4 pages. DOI: 10.1145/3239235.3267437

Ben Hermann, Stefan Winter, and Janet Siegmund. 2020. Community expectations for research artifacts and evaluation processes. In Proceedings of the 28th ACM Joint Meeting on ESEC/FSE. ACM. DOI: 10.1145/3368089.3409767

Patricia Lago, Per Runeson, Qunying Song, and Roberto Verdecchia. 2024. Threats to Validity in Software Engineering – hypocritical paper section or essential analysis?. In Proceedings of the 18th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (Barcelona, Spain) (ESEM ’24). Association for Computing Machinery, New York, NY, USA, 314–324. DOI: 10.1145/3674805.3686691

Ruchika Malhotra and Megha Khanna. 2018. Threats to validity in search-based predictive modelling for software engineering. IET Software 12, 4 (Aug. 2018), 293–305. DOI: 10.1049/iet-sen.2018.5143

Brian A. Nosek and Timothy M. Errington. 2020. What is replication? PLOS Biology 18, 3 (03 2020), 1–8. DOI: 10.1371/journal.pbio.3000691

Kai Petersen and Cigdem Gencel. 2013. Worldviews, Research Methods, and their Relationship to Validity in Empirical Software Engineering Research. In Proceedings of the 2013 Joint Conference of the 23nd International Workshop on Software Measurement (IWSM) and the 8th International Conference on Software Process and Product Measurement (IWSM-MENSURA ’13). IEEE Computer Society, USA, 81–89. DOI: 10.1109/IWSM-Mensura.2013.22

Hans E. Plesser. 2018. Reproducibility vs. Replicability: A Brief History of a Confused Terminology. Frontiers in Neuroinformatics 11 (2018). DOI: 10.3389/fninf.2017.00076

Adrian Santos, Sira Vegas, Markku Oivo, and Natalia Juristo. 2021. Comparing the results of replications in software engineering. Empirical Softw. Engg. 26, 2 (March 2021), 41 pages. DOI: 10.1007/s10664-020-09907-7

Moritz Schloegel, Nils Bars, Nico Schiller, Lukas Bernhard, Tobias Scharnowski, Addison Crump, Arash Ale-Ebrahim, Nicolai Bissantz, Marius Muench, and Thorsten Holz. 2024. SoK: Prudent Evaluation Practices for Fuzzing . In 2024 IEEE Symposium on Security and Privacy (SP). IEEE Computer Society, Los Alamitos, CA, USA, 1974–1993. DOI: 10.1109/SP54263.2024.00137

Forrest J. Shull, Jeffrey C. Carver, Sira Vegas, and Natalia Juristo. 2008. The role of replications in Empirical Software Engineering. Empirical Softw. Engg. 13, 2 (apr 2008), 211–218. DOI: 10.1007/s10664-008-9060-1

Janet Siegmund, Norbert Siegmund, and Sven Apel. 2015. Views on internal and external validity in empirical software engineering. In Proceedings of the 37th International Conference on Software Engineering - Volume 1 (Florence, Italy) (ICSE ’15). IEEE Press, 9–19.

Fabio Q. Silva, Marcos Suassuna, A. César França, Alicia M. Grubb, Tatiana B. Gouveia, Cleviton V. Monteiro, and Igor Ebrahim Santos. 2014. Replication of empirical studies in software engineering research: a systematic mapping study. Empirical Softw. Engg. 19, 3 (June 2014), 501–557. DOI: 10.1007/s10664-012-9227-7

Martín Solari, Sira Vegas, and Natalia Juristo. 2018. Content and structure of laboratory packages for software engineering experiments. Information and Software Technology 97 (2018), 64–79. DOI: 10.1016/j.infsof.2017.12.016

Eudis Teixeira, Liliane Fonseca, Bruno Cartaxo, and Sergio Soares. 2019. PrioriTTVs: A process aimed at supporting researchers to prioritize threats to validity and their mitigation actions when planning controlled experiments in SE. Inf. Softw. Technol. 115, C (Nov. 2019), 20–22. DOI: 10.1016/j.infsof.2019.07.008

Eudis Teixeira, Liliane Fonseca, and Sergio Soares. 2018. Threats to validity in controlled experiments in software engineering: what the experts say and why this is relevant. In Proceedings of the XXXII Brazilian Symposium on Software Engineering (Sao Carlos, Brazil) (SBES ’18). Association for Computing Machinery, New York, NY, USA, 52–61. DOI: 10.1145/3266237.3266264

Roberto Verdecchia, Emelie Engström, Patricia Lago, Per Runeson, and Qunying Song. 2023. Threats to validity in software engineering research: A critical reflection. Inf. Softw. Technol. 164, C (Dec. 2023), 4 pages. DOI: 10.1016/j.infsof.2023.107329

Claes Wohlin, Per Runeson, Martin Höst, Magnus C. Ohlsson, Böjrn Regnell, and Anders Wesslén. 2000. Experimentation in Software Engineering: An Introduction. Springer New York, NY. DOI: 10.1007/978-1-4615-4625-2

Marvin Wyrich and Sven Apel. 2024. Evidence Tetris in the Pixelated World of Validity Threats. In Proceedings of the 1st IEEE/ACM International Workshop on Methodological Issues with Empirical Studies in Software Engineering (Lisbon, Portugal) (WSESE ’24). Association for Computing Machinery, New York, NY, USA, 13–16. DOI: 10.1145/3643664.3648203
Published
2025-09-22
AZEVEDO, Ivanildo; VASCONCELOS, Ana Paula; TEIXEIRA, Eudis; SOARES, Sergio. Reimagining Studies’ Replication: A Validity-Driven Analysis of Threats in Empirical Software Engineering. In: BRAZILIAN SYMPOSIUM ON SOFTWARE ENGINEERING (SBES), 39. , 2025, Recife/PE. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2025 . p. 734-740. ISSN 2833-0633. DOI: https://doi.org/10.5753/sbes.2025.11270.