Exploring Tools for Flaky Test Detection, Correction, and Mitigation: A Systematic Mapping Study

Pedro Anderson Costa Martins; Victor Anthony Alves; Iraneide Lima; Carla Bezerra; Ivan Machado

doi:10.5753/sast.2024.3700

Pedro Anderson Costa Martins UFC
Victor Anthony Alves UFC
Iraneide Lima UFC
Carla Bezerra UFC
Ivan Machado UFBA

DOI: https://doi.org/10.5753/sast.2024.3700

Resumo

Flaky tests, characterized by their non-deterministic behavior, present significant challenges in software testing. These tests exhibit uncertain results, even when executed on unchanged code. In the context of industrial projects that widely adopt continuous integration, the impact of flaky tests becomes critical. With thousands of tests, a single flaky test can disrupt the entire build and release process, leading to delays in software deliveries. In our study, we conducted a systematic mapping to investigate tools related to flaky tests. From a pool of 37 research papers, we identified 30 tools specifically designed for detecting, mitigating, and repairing flakiness in automated tests. Our analysis provides an overview of these tools, highlighting their objectives, techniques, and approaches. Additionally, we delve into the highest-level characteristics of these tools, including the causes they address. Notably, approximately 46% of the tools focus on tackling test order dependency issues, while a substantial majority (70%) of the tools are analyzed in the context of the Java programming language. These findings serve as valuable insights for two key groups of stakeholders: (Software Testing Community:) Researchers and practitioners can leverage this knowledge to enhance their understanding of flaky tests and explore effective mitigation strategies; (Tool Developers:) The compilation of available tools offers a centralized resource for selecting appropriate solutions based on specific needs. By addressing flakiness, we aim to improve the reliability of automated testing, streamline development processes, and foster confidence in software quality.

Palavras-chave: Flaky tests, tools, systematic mapping

Referências

Azeem Ahmad, Francisco Gomes de Oliveira Neto, Zhixiang Shi, Kristian Sandahl, and Ola Leifler. 2021. A Multi-factor Approach for Flaky Test Detection and Automated Root Cause Analysis. In 2021 28th Asia-Pacific Software Engineering Conference (APSEC). IEEE, 338–348.

Abdulrahman Alshammari, Christopher Morris, Michael Hilton, and Jonathan Bell. 2021. FlakeFlagger: Predicting Flakiness Without Rerunning Tests. In 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE). 1572–1584. DOI: 10.1109/ICSE43902.2021.00140

Jonathan Bell and Gail Kaiser. 2014. Unit test virtualization with VMVM. In Proceedings of the 36th International Conference on Software Engineering. 550–561.

Jonathan Bell, Gail Kaiser, Eric Melski, and Mohan Dattatreya. 2015. Efficient dependency detection for safe Java test acceleration. In Proceedings of the 2015 10th joint meeting on foundations of software engineering. 770–781.

Jonathan Bell, Owolabi Legunsen, Michael Hilton, Lamyaa Eloussi, Tifany Yung, and Darko Marinov. 2018. DeFlaker: Automatically Detecting Flaky Tests. In 2018 IEEE/ACM 40th International Conference on Software Engineering (ICSE). 433–444. DOI: 10.1145/3180155.3180164

Matteo Biagiola, Andrea Stocco, Ali Mesbah, Filippo Ricca, and Paolo Tonella. 2019. Web test dependency detection. In Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 154–164.

Marcello Cordeiro, Denini Silva, Leopoldo Teixeira, Breno Miranda, and Marcelo d’Amorim. 2021. Shaker: a tool for detecting more flaky tests faster. In 2021 36th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 1281–1285.

Maxime Cordy, Renaud Rwemalika, Adriano Franci, Mike Papadakis, and Mark Harman. 2022. Flakime: laboratory-controlled test flakiness impact assessment. In Proceedings of the 44th International Conference on Software Engineering. 982–994.

Zhen Dong, Abhishek Tiwari, Xiao Liang Yu, and Abhik Roychoudhury. 2021. Flaky test detection in Android via event order exploration. In Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 367–378.

Saikat Dutta, August Shi, Rutvik Choudhary, Zhekun Zhang, Aryaman Jain, and Sasa Misailovic. 2020. Detecting flaky tests in probabilistic and machine learning applications. In Proceedings of the 29th ACM SIGSOFT international symposium on software testing and analysis. 211–224.

Moritz Eck, Fabio Palomba, Marco Castelluccio, and Alberto Bacchelli. 2019. Understanding Flaky Tests: The Developer’s Perspective. In Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (Tallinn, Estonia) (ESEC/FSE 2019). Association for Computing Machinery, New York, NY, USA, 830–840. DOI: 10.1145/3338906.3338945

Sakina Fatima, Taher A Ghaleb, and Lionel Briand. 2022. Flakify: A black-box, language model-based predictor for flaky tests. IEEE Transactions on Software Engineering (2022).

Mattia Fazzini, Alessandra Gorla, and Alessandro Orso. 2020. A framework for automated test mocking of mobile apps. In Proceedings of the 35th IEEE/ACM International Conference on Automated Software Engineering. 1204–1208.

Alessio Gambi, Jonathan Bell, and Andreas Zeller. 2018. Practical test dependency detection. In 2018 IEEE 11th International Conference on Software Testing, Verification and Validation (ICST). IEEE, 1–11.

Martin Gruber and Gordon Fraser. 2022. A Survey on How Test Flakiness Affects Developers and What Support They Need To Address It. In 2022 IEEE Conference on Software Testing, Verification and Validation (ICST). 82–92. DOI: 10.1109/ICST53961.2022.00020

Martin Gruber and Gordon Fraser. 2022. A Survey on How Test Flakiness Affects Developers and What Support They Need To Address It. In 2022 IEEE Conference on Software Testing, Verification and Validation (ICST). 82–92. DOI: 10.1109/ICST53961.2022.00020

Martin Gruber, Stephan Lukasczyk, Florian Kroiß, and Gordon Fraser. 2021. An Empirical Study of Flaky Tests in Python. In 2021 14th IEEE Conference on Software Testing, Verification and Validation (ICST). 148–158. DOI: 10.1109/ICST49551.2021.00026

Alex Gyori, Ben Lambeth, August Shi, Owolabi Legunsen, and Darko Marinov. 2016. NonDex: A tool for detecting and debugging wrong assumptions on Java API specifications. In Proceedings of the 2016 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering. 993–997.

Alex Gyori, August Shi, Farah Hariri, and Darko Marinov. 2015. Reliable testing: Detecting state-polluting tests to prevent test dependency. In Proceedings of the 2015 international symposium on software testing and analysis. 223–233.

Sarra Habchi, Guillaume Haben, Mike Papadakis, Maxime Cordy, and Yves Le Traon. 2022. A Qualitative Study on the Sources, Impacts, and Mitigation Strategies of Flaky Tests. In 2022 IEEE Conference on Software Testing, Verification and Validation (ICST). 244–255. DOI: 10.1109/ICST53961.2022.00034

Chen Huo and James Clause. 2014. Improving Oracle Quality by Detecting Brittle Assertions and Unused Inputs in Tests. In Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering (Hong Kong, China) (FSE 2014). Association for Computing Machinery, New York, NY, USA, 621–631. DOI: 10.1145/2635868.2635917

Samireh Jalali and Claes Wohlin. 2012. Systematic literature studies: database searches vs. backward snowballing. In Proceedings of the ACM-IEEE international symposium on Empirical software engineering and measurement. 29–38.

Staffs Keele et al. 2007. Guidelines for performing systematic literature reviews in software engineering.

Wing Lam, Patrice Godefroid, Suman Nath, Anirudh Santhiar, and Suresh Thummalapenta. 2019. Root Causing Flaky Tests in a Large-Scale Industrial Setting. In Proceedings of the 28th ACM SIGSOFT International Symposium on Software Testing and Analysis (Beijing, China) (ISSTA 2019). Association for Computing Machinery, New York, NY, USA, 101–111. DOI: 10.1145/3293882.3330570

Wing Lam, Kıvanç Muşlu, Hitesh Sajnani, and Suresh Thummalapenta. 2020. A Study on the Lifecycle of Flaky Tests. In Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering (Seoul, South Korea) (ICSE ’20). Association for Computing Machinery, New York, NY, USA, 1471–1482. DOI: 10.1145/3377811.3381749

Wing Lam, Reed Oei, August Shi, Darko Marinov, and Tao Xie. 2019. iDFlakies: A framework for detecting and partially classifying flaky tests. In 2019 12th ieee conference on software testing, validation and verification (icst). IEEE, 312–322.

Wing Lam, Stefan Winter, Anjiang Wei, Tao Xie, Darko Marinov, and Jonathan Bell. 2020. A Large-Scale Longitudinal Study of Flaky Tests. Proc. ACM Program. Lang. 4, OOPSLA, Article 202 (nov 2020). DOI: 10.1145/3428270

Tanakorn Leesatapornwongsa, Xiang Ren, and Suman Nath. 2022. FlakeRepro: automated and efficient reproduction of concurrency-related flaky tests. In Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 1509–1520.

Chengpeng Li and August Shi. 2022. Evolution-aware detection of orderdependent flaky tests. In Proceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis. 114–125.

Chengpeng Li, Chenguang Zhu, Wenxi Wang, and August Shi. 2022. Repairing order-dependent flaky tests via test generation. In Proceedings of the 44th International Conference on Software Engineering. 1881–1892.

Qingzhou Luo, Farah Hariri, Lamyaa Eloussi, and Darko Marinov. 2014. An Empirical Analysis of Flaky Tests. In Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering (Hong Kong, China) (FSE 2014). Association for Computing Machinery, New York, NY, USA, 643–653. DOI: 10.1145/2635868.2635920

Jesús Morán Barbón, Cristian Augusto Alonso, Antonia Bertolino, Claudio A Riva Álvarez, Pablo Javier Tuya González, et al. 2020. Flakyloc: flakiness localization for reliable test suites in web applications. Journal of Web Engineering, 2 (2020).

G.J. Myers, C. Sandler, and T. Badgett. 2011. The Art of Software Testing. Wiley. [link]

Owain Parry, Gregory M. Kapfhammer, Michael Hilton, and Phil McMinn. 2021. A Survey of Flaky Tests. ACM Trans. Softw. Eng. Methodol. 31, 1, Article 17 (oct 2021), 74 pages. DOI: 10.1145/3476105

Owain Parry, Gregory M Kapfhammer, Michael Hilton, and Phil McMinn. 2022. Evaluating Features for Machine Learning Detection of Order-and Non-Order-Dependent Flaky Tests. In 2022 IEEE Conference on Software Testing, Verification and Validation (ICST). IEEE, 93–104.

Owain Parry, Gregory M. Kapfhammer, Michael Hilton, and Phil McMinn. 2022. Surveying the developer experience of flaky tests. In Proceedings of the 44th International Conference on Software Engineering: Software Engineering in Practice (Pittsburgh, Pennsylvania) (ICSE-SEIP ’22). Association for Computing Machinery, New York, NY, USA, 253–262. DOI: 10.1145/3510457.3513037

Kai Petersen, Sairam Vakkalanka, and Ludwik Kuzniarz. 2015. Guidelines for conducting systematic mapping studies in software engineering: An update. Information and software technology 64 (2015), 1–18.

Gustavo Pinto, Breno Miranda, Supun Dissanayake, Marcelo d’Amorim, Christoph Treude, and Antonia Bertolino. 2020. What is the Vocabulary of Flaky Tests?. In Proceedings of the 17th International Conference on Mining Software Repositories (Seoul, Republic of Korea) (MSR ’20). Association for Computing Machinery, New York, NY, USA, 492–502. DOI: 10.1145/3379597.3387482

Valeria Pontillo. [n.d.]. Static test flakiness prediction. In Proceedings of the ACM/IEEE 44th International Conference on Software Engineering: Companion Proceedings. 325–327.

Yihao Qin, Shangwen Wang, Kui Liu, Bo Lin, Hongjun Wu, Li Li, Xiaoguang Mao, and Tegawendé F Bissyandé. 2022. PEELER: Learning to Effectively Predict Flakiness without Running Tests. In 2022 IEEE International Conference on Software Maintenance and Evolution (ICSME). IEEE, 257–268.

August Shi, Alex Gyori, Owolabi Legunsen, and Darko Marinov. 2016. Detecting assumptions on deterministic implementations of non-deterministic specifications. In 2016 IEEE International Conference on Software Testing, Verification and Validation (ICST). IEEE, 80–90.

August Shi, Wing Lam, Reed Oei, Tao Xie, and Darko Marinov. 2019. iFixFlakies: A framework for automatically fixing order-dependent flaky tests. In Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 545–555.

Denini Silva, Leopoldo Teixeira, and Marcelo d’Amorim. 2020. Shake It! Detecting Flaky Tests Caused by Concurrency with Shaker. In 2020 IEEE International Conference on Software Maintenance and Evolution (ICSME). 301–311. DOI: 10.1109/ICSME46990.2020.00037

Amjed Tahir, Shawn Rasheed, Jens Dietrich, Negar Hashemi, and Lu Zhang. 2023. Test flakiness’ causes, detection, impact and responses: A multivocal review. Journal of Systems and Software 206 (2023), 111837. DOI: 10.1016/j.jss.2023.111837

Roberto Verdecchia, Emilio Cruciani, Breno Miranda, and Antonia Bertolino. 2021. Know you neighbor: Fast static prediction of test flakiness. IEEE Access 9 (2021), 76119–76134.

Ruixin Wang, Yang Chen, and Wing Lam. 2022. iPFlakies: a framework for detecting and fixing python order-dependent flaky tests. In Proceedings of the ACM/IEEE 44th International Conference on Software Engineering: Companion Proceedings. 120–124.

Claes Wohlin. 2014. Guidelines for snowballing in systematic literature studies and a replication in software engineering. In Proceedings of the 18th International Conference on Evaluation and Assessment in Software Engineering (London, England, United Kingdom) (EASE ’14). Association for Computing Machinery, New York, NY, USA, Article 38, 10 pages. DOI: 10.1145/2601248.2601268

Peilun Zhang, Yanjie Jiang, Anjiang Wei, Victoria Stodden, Darko Marinov, and August Shi. 2021. Domain-specific fixes for flaky tests with wrong assumptions on underdetermined specifications. In 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE). IEEE, 50–61.

Sai Zhang, Darioush Jalali, Jochen Wuttke, Kıvanç Muşlu, Wing Lam, Michael D Ernst, and David Notkin. 2014. Empirically revisiting the test independence assumption. In Proceedings of the 2014 International Symposium on Software Testing and Analysis. 385–396.

Behrouz Zolfaghari, Reza M Parizi, Gautam Srivastava, and Yoseph Hailemariam. 2021. Root causing, detecting, and fixing flaky tests: state of the art and future roadmap. Software: Practice and Experience 51, 5 (2021), 851–867.