Generación de Pruebas Unitarias con LLMs en Entornos Industriales: Desafíos, Evolución y Lecciones Prácticas

Eneko Pizarro; Maider Azanza; Beatriz Pérez Lamancha

doi:10.5753/cibse.2025.35294

Eneko Pizarro LKS Next
Maider Azanza UPV
Beatriz Pérez Lamancha LKS Next

DOI: https://doi.org/10.5753/cibse.2025.35294

Resumo

Los Modelos de Lenguaje de Gran Escala (LLMs) muestran potencial para generar pruebas unitarias automáticamente, pero su aplicación industrial genera desafíos. Presentamos un caso de estudio longitudinal sobre la implementacion y evaluación de LLMs para generación de pruebas en la empresa LKS Next, integrando herramientas estandar como SonarQube. Nuestro enfoque revela hallazgos sobre la evolucion temporal de estas tecnologías en entornos de producción y proporciona lecciones aprendidas. Los resultados ofrecen una guía industrial basada en evidencia para organizaciones que consideran adoptar estas soluciones, destacando consideraciones practicas de integración y mantenibilidad a menudo ausentes en estudios teoricos.

Palavras-chave: Pruebas unitarias e integración, Modelos de lenguaje de gran tamaño, Estudio de caso industrial

Referências

Campbell, A. and Papapetrou, P. (2013). SonarQube in action. Manning Publications Co.

Chen, Y., Hu, Z., Zhi, C., Han, J., Deng, S., and Yin, J. (2024). Chatunitest: A framework for llm-based test generation. In Companion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering (FSE). ACM.

Freeman, S., Mackinnon, T., Pryce, N., and Walnes, J. (2004). jMock: supporting responsibility-based design with mock objects. In Companion to the 19th Annual ACM SIGPLAN Conference on Object-Oriented Programming, Systems, Languages, and Applications, OOPSLA. ACM.

H. Washizaki, e. (2024). Guide to the Software Engineering Body of Knowledge (SWE-BOK Guide). IEEE Computer Society.

ISO (2021). IEEE/ISO/IEC international standard - software and systems engineering– software testing–part 4: Test techniques. ISO/IEC/IEEE 29119-4:2021(E).

Jiang, W., Gao, X., Zhai, J., Ma, S., Zhang, X., and Shen, C. (2024). From effectiveness to efficiency: Comparative evaluation of code generated by lcgms for bilingual programming questions.

Kaczanowski, T. (2013). Practical Unit Testing with JUnit and Mockito. Tomasz Kaczanowski, POL.

Kracht, J. S., Petrovic, J. Z., and Walcott-Justice, K. R. (2014). Empirically evaluating the quality of automatically generated and manually written test suites. 14th International Conference on Quality Software, pages 256–265.

Liu, H., Liu, L., Yue, C., Wang, Y., and Deng, B. (2024). Autotestgpt: A system for the automated generation of software test cases based on chatgpt. Journal of Software, 19(4).

López, J. A. H., Chen, B., Saad, M., Sharma, T., and Varró, D. (2025). On inter-dataset code duplication and data leakage in large language models. IEEE Transactions on Software Engineering, 51(1).

Prates, L. and Pereira, R. (2025). Devsecops practices and tools. International Journal of Information Security, 24(1):1–25.

Schäfer, M., Nadi, S., Eghbali, A., and Tip, F. (2024). An empirical evaluation of using large language models for automated unit test generation. IEEE Transactions on Software Engineering, 50(1):85–105.

Sergeyuk, A., Golubev, Y., Bryksin, T., and Ahmed, I. (2025). Using ai-based coding assistants in practice: State of affairs, perceptions, and ways forward. Information and Software Technology, 178.

Siddiq, M., Da Silva, J., Tanvir, R., Ulfat, N., Al Rifat, F., and Carvalho, V. (2024). Using large language models to generate junit tests: An empirical study. In Proceedings of the 28th International Conference on Evaluation and Assessment in Software Engineering (EASE). ACM.

Tillmann, N. and Schulte, W. (2005). Parameterized unit tests. In Proceedings of the 10th European Software Engineering Conference. ACM.

Wang, C., Li, Z., Gao, C., Wang, W., Peng, T., Huang, H., Deng, Y., Wang, S., and Lyu, M. (2024a). Exploring multi-lingual bias of large code models in code generation.

Wang, J., Huang, Y., Chen, C., Liu, Z., Wang, S., and Wang, Q. (2024b). Software testing with large language models: Survey, landscape, and vision. IEEE Transactions on Software Engineering, 50(4):911–936.

Wu, T., Terry, M., and Cai, C. (2022). Ai chains: Transparent and controllable human-ai interaction by chaining large language model prompts. In Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems. ACM.

Wu, Y., Li, Z., Zhang, J. M., and Liu, Y. (2024). Condefects: A complementary dataset to address the data leakage concern for llm-based fault localization and program repair. In Companion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering (FSE). ACM.