Unit Test Generation with LLMs in Industrial Environments: Challenges, Evolution, and Practical Lessons
Abstract
Large Language Models (LLMs) show potential for automatically generating unit tests, but their industrial application presents challenges. We present a longitudinal case study on the implementation and evaluation of LLMs for test generation at LKS Next, integrating standard tools such as SonarQube. Our approach reveals findings on the temporal evolution of these technologies in production environments and provides learned lessons. The results offer evidence-based industrial guidance for organizations considering adopting these solutions, highlighting practical integration and maintainability considerations often absent in theoretical studies.
References
Chen, Y., Hu, Z., Zhi, C., Han, J., Deng, S., and Yin, J. (2024). Chatunitest: A framework for llm-based test generation. In Companion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering (FSE). ACM.
Freeman, S., Mackinnon, T., Pryce, N., and Walnes, J. (2004). jMock: supporting responsibility-based design with mock objects. In Companion to the 19th Annual ACM SIGPLAN Conference on Object-Oriented Programming, Systems, Languages, and Applications, OOPSLA. ACM.
H. Washizaki, e. (2024). Guide to the Software Engineering Body of Knowledge (SWE-BOK Guide). IEEE Computer Society.
ISO (2021). IEEE/ISO/IEC international standard - software and systems engineering– software testing–part 4: Test techniques. ISO/IEC/IEEE 29119-4:2021(E).
Jiang, W., Gao, X., Zhai, J., Ma, S., Zhang, X., and Shen, C. (2024). From effectiveness to efficiency: Comparative evaluation of code generated by lcgms for bilingual programming questions.
Kaczanowski, T. (2013). Practical Unit Testing with JUnit and Mockito. Tomasz Kaczanowski, POL.
Kracht, J. S., Petrovic, J. Z., and Walcott-Justice, K. R. (2014). Empirically evaluating the quality of automatically generated and manually written test suites. 14th International Conference on Quality Software, pages 256–265.
Liu, H., Liu, L., Yue, C., Wang, Y., and Deng, B. (2024). Autotestgpt: A system for the automated generation of software test cases based on chatgpt. Journal of Software, 19(4).
López, J. A. H., Chen, B., Saad, M., Sharma, T., and Varró, D. (2025). On inter-dataset code duplication and data leakage in large language models. IEEE Transactions on Software Engineering, 51(1).
Prates, L. and Pereira, R. (2025). Devsecops practices and tools. International Journal of Information Security, 24(1):1–25.
Schäfer, M., Nadi, S., Eghbali, A., and Tip, F. (2024). An empirical evaluation of using large language models for automated unit test generation. IEEE Transactions on Software Engineering, 50(1):85–105.
Sergeyuk, A., Golubev, Y., Bryksin, T., and Ahmed, I. (2025). Using ai-based coding assistants in practice: State of affairs, perceptions, and ways forward. Information and Software Technology, 178.
Siddiq, M., Da Silva, J., Tanvir, R., Ulfat, N., Al Rifat, F., and Carvalho, V. (2024). Using large language models to generate junit tests: An empirical study. In Proceedings of the 28th International Conference on Evaluation and Assessment in Software Engineering (EASE). ACM.
Tillmann, N. and Schulte, W. (2005). Parameterized unit tests. In Proceedings of the 10th European Software Engineering Conference. ACM.
Wang, C., Li, Z., Gao, C., Wang, W., Peng, T., Huang, H., Deng, Y., Wang, S., and Lyu, M. (2024a). Exploring multi-lingual bias of large code models in code generation.
Wang, J., Huang, Y., Chen, C., Liu, Z., Wang, S., and Wang, Q. (2024b). Software testing with large language models: Survey, landscape, and vision. IEEE Transactions on Software Engineering, 50(4):911–936.
Wu, T., Terry, M., and Cai, C. (2022). Ai chains: Transparent and controllable human-ai interaction by chaining large language model prompts. In Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems. ACM.
Wu, Y., Li, Z., Zhang, J. M., and Liu, Y. (2024). Condefects: A complementary dataset to address the data leakage concern for llm-based fault localization and program repair. In Companion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering (FSE). ACM.
