Avaliação de Qualidade de Código Java gerado por Large Language Models

Marco Tullio Oliveira; Pedro Márcio Oliveira Silveira; Michelle Hanne S. de Andrade

doi:10.5753/eres.2025.16845

Marco Tullio Oliveira PUC-Minas
Pedro Márcio Oliveira Silveira PUC-Minas
Michelle Hanne S. de Andrade PUC-Minas

DOI: https://doi.org/10.5753/eres.2025.16845

Resumo

A qualidade dos sistemas de software tem se tornado cada vez mais relevante devido ao seu amplo uso em diversas áreas. Os modelos de linguagem de grande porte (LLMs, do inglês Large Language Models) têm surgido como uma ferramenta promissora para aprimorar a qualidade do software, mas ainda há lacunas no entendimento de como os LLMs afetam essa qualidade. Este trabalho abordou essa lacuna, propondo investigar o impacto dos LLMs na qualidade do software gerado. Para alcançar esse objetivo, realizou-se análises para mensurar a qualidade do código gerado na linguagem Java por LLMs, utilizando 204 problemas de programação. Este estudo buscou contribuir para a redução da lacuna existente na literatura sobre o tema, ao analisar métricas de qualidade relacionadas à aplicação eficiente de LLMs no desenvolvimento de software.

Palavras-chave: Qualidade de software, LLMs, Desenvolvimento de software, Código gerado, Métricas de qualidade

Referências

ALSHAHWAN, N., CHHEDA, J., FINOGENOVA, A., GOKKAYA, B., HARMAN, M., HARPER, I., MARGINEAN, A., SENGUPTA, S., and WANG, E. (2024). Automated unit test improvement using large language models at meta. In Companion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering, pages 185–196.

BARBETTA, P. A., REIS, M. M., and BORNIA, A. C. (2010). Estatística: para cursos de engenharia e informática. Editora da UFSC.

BIBIANO, A. C. (2022). Completeness of composite refactorings for smell removal. In Companhia Proceedings da 44ª Conferência Internacional IEEE/ACM sobre Engenharia de Software (ICSE-Companion), pages 264–268, Pittsburgh, PA, EUA.

COIGNION, T., QUINTON, C., and Rouvoy, R. (2024). A performance study of llm-generated code on leetcode. In Proceedings of the 28th International Conference on Evaluation and Assessment in Software Engineering, pages 79–89.

HOURANI, H., HAMMAD, A., and LAFI, M. (2019). The impact of artificial intelligence on software testing. pages 565–570.

IMAI, S. (2022). Is github copilot a substitute for human pair-programming? an empirical study. In 2022 IEEE/ACM 44th International Conference on Software Engineering: Companion Proceedings (ICSE-Companion), pages 319–321.

LU, Y., LI, C., WANG, S., LIU, Y., and DAI, J. (2022). A quality evaluation method for software testing about safety-critical software. pages 35–42.

LUO, X. and XIE, L. (2018). Research on artificial intelligence-based sharing education in the era of internet+. In 2018 International Conference on Intelligent Transportation, Big Data & Smart City (ICITBS), pages 335–338.

MAYER, L., HEUMANN, C., and Aßenmacher, M. (2024). Can opensource beat chatgpt?–a comparative study of large language models for text-to-code generation. arXiv preprint arXiv:2409.04164.

MERKEL, M. and Dörpinghaus, J. (2025). A case study on the transformative potential of ai in software engineering on leetcode and chatgpt. arXiv preprint arXiv:2501.03639.

NANADANI, H., SAAD, M., and SHARMA, T. (2023). Calibrating deep learning-based code smell detection using human feedback. pages 37–48.

NGUYEN, N. and NADI, S. (2022). An empirical evaluation of github copilot’s code suggestions. In 2022 IEEE/ACM 19th International Conference on Mining Software Repositories (MSR), pages 1–5.

NIU, C., ZHANG, T., LI, C., LUO, B., and NG, V. (2024). On evaluating the efficiency of source code generated by llms. In Proceedings of the 2024 IEEE/ACM First International Conference on AI Foundation Models and Software Engineering, pages 103–107.

TANG, H. and NADI, S. (2023). Evaluating software documentation quality. pages 67–78.

ZHAO, Y., HU, Y., and GONG, J. (2021). Research on international standardization of software quality and software testing. pages 56–62.