Effectiveness of Small and Large Language Models for PL/SQL Bad Smell Detection

  • Vinicius Ferreira de Sousa Universidade Federal de Campina Grande (UFCG)
  • Cláudio de Souza Baptista Universidade Federal de Campina Grande (UFCG)
  • André Luiz Firmino Alves Instituto Federal de Educação, Ciência e Tecnologia da Paraíba (IFPB) https://orcid.org/0000-0001-7883-5129
  • Hugo Feitosa de Figueirêdo Instituto Federal de Educação, Ciência e Tecnologia da Paraíba (IFPB)

Resumo


The quality of PL/SQL code is critical for enterprise systems built on Oracle Database, yet traditional quality-assurance methods struggle to uncover nuanced code smells and complex semantic flaws. This study evaluates the effectiveness of two large language models (GPT-4o, Gemini 2.0 Flash) and two small language models (GPT-4o mini, Phi-4) for detection of bad smells in PL/SQL code. Using a uniform prompt structure, each model was tasked with identifying bad smells in a curated dataset of PL/SQL snippets. Results show that effectiveness varied significantly across models and bad smell types. These findings offer practical insights to aid in selecting and leveraging language models for PL/SQL code analysis.

Palavras-chave: PL/SQL, Code Smells, Code Analysis, Language Models, Artificial Intelligence

Referências

Abdin, M., et al. (2024). Phi-4 technical report. arXiv:2412.08905.

Almeida, Y., et al. (2024). AICodeReview: Advancing code quality with AI-enhanced reviews. SoftwareX, 26(101677).

Almeida Filho, F. G. de, et al. (2019). Prevalence of bad smells in PL/SQL projects. In 2019 IEEE/ACM 27th International Conference on Program Comprehension (ICPC), pages 116–121. IEEE.

Fang, C., et al. (2024). Large language models for code analysis: Do LLMs really do their job?. In 33rd USENIX Security Symposium (USENIX Security 24) (pp. 829–846). USENIX Association.

Google DeepMind (2024). Introducing Gemini 2.0: our new AI model for the agentic era. (11 December 2024). Retrieved April 17, 2025 from [link].

Gopinath, K. (2023). Exploring the role of large language models in automated code review and software quality enhancement. International Journal of Innovative Research in Science, Engineering and Technology, 12(9):11428–11438.

Haider, M. A., Mostofa, A. B., Mosaddek, S. S. B., Iqbal, A., and Ahmed, T. (2024). Prompting and fine-tuning large language models for automated code review comment generation. arXiv:2411.10129.

Hassid, M., Remez, T., Gehring, J., Schwartz, R., and Adi, Y. (2024). The larger the better? Improved LLM code-generation via budget reallocation. arXiv:2404.00725.

Holden, D. and Kahani, N. (2024). Code linting using language models. arXiv:2406.19508.

Liu, H., Zhang, Y., Saikrishna, V., Tian, Q., and Zheng, K. (2024). Prompt learning for multi-label code smell detection: A promising approach. arXiv:2402.10398.

Lucas, K., Gheyi, R., Soares, E., Ribeiro, M., and Machado, I. (2024). Evaluating large language models in detecting test smells. arXiv:2407.19261.

Nascimento, D., Pires, C. E., and Massoni, T. (2013). PL/SQL Advisor: uma ferramenta baseada em análise estática para sugerir melhorias para procedimentos armazenados. In Anais do IX Simpósio Brasileiro de Sistemas de Informação, pages 343–354, Porto Alegre, RS, Brazil. SBC.

OpenAI (2024). GPT-4o mini: advancing cost-efficient intelligence. (18 July 2024). Retrieved April 17, 2025 from [link].

OpenAI. (2024). GPT-4o system card. arXiv:2410.21276.

Opitz, J. (2024). A closer look at classification evaluation metrics and a critical reflection of common evaluation practice. Transactions of the Association for Computational Linguistics, 12:820–836.

Russell, J., et al. (2003). PL/SQL User’s Guide and Reference, 10g Release 1 (10.1), chapter 1 Overview of PL/SQL. Oracle Corporation, Redwood City, CA, USA.

Sengamedu, S. and Zhao, H. (2022). Neural language models for code quality identification. In Proceedings of the 6th International Workshop on Machine Learning Techniques for Software Quality Evaluation, MaLTeSQuE 2022, pages 5–10, New York, NY, USA. Association for Computing Machinery.

Sheikhaei, M. S., Tian, Y., Wang, S., and Xu, B. (2024). An empirical study on the effectiveness of large language models for SATD identification and classification. Empirical Software Engineering, 29(159).

Silva, L. L., Silva, J. R. da, Montandon, J. E., Andrade, M., and Valente, M. T. (2024). Detecting code smells using ChatGPT: Initial insights. In Proceedings of the 18th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement, ESEM '24, pages 400–406, New York, NY, USA. Association for Computing Machinery.

Sun, Z., et al. (2024). Enhancing code generation performance of smaller models by distilling the reasoning ability of LLMs. arXiv:2403.13271.

Thomson, P. (2021). Static analysis: An introduction: The fundamental challenge of software engineering is one of complexity. Queue, 19(4), 29–41.

Yadav, P. S., Rao, R. S., Mishra, A., and Gupta, M. (2024). Machine learning-based methods for code smell detection: A survey. Applied Sciences, 14(14):6149.

Zhang, Z., et al. (2024). Unifying the perspectives of NLP and software engineering: A survey on language models for code. arXiv:2311.07989.
Publicado
29/09/2025
SOUSA, Vinicius Ferreira de; BAPTISTA, Cláudio de Souza; ALVES, André Luiz Firmino; FIGUEIRÊDO, Hugo Feitosa de. Effectiveness of Small and Large Language Models for PL/SQL Bad Smell Detection. In: SIMPÓSIO BRASILEIRO DE BANCO DE DADOS (SBBD), 40. , 2025, Fortaleza/CE. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2025 . p. 399-412. ISSN 2763-8979. DOI: https://doi.org/10.5753/sbbd.2025.247256.