Unveiling Power on Combining Prompt Engineering Techniques: An Experimental Evaluation on Code Generation

Cristofer Girardi; Damires Yluska de Souza Fernandes; Alex Sandro da Cunha Rêgo

doi:10.5753/sbbd.2025.247251

Cristofer Girardi Instituto Federal de Educação, Ciência e Tecnologia da Paraíba (IFPB)
Damires Yluska de Souza Fernandes Instituto Federal de Educação, Ciência e Tecnologia da Paraíba (IFPB)
Alex Sandro da Cunha Rêgo Instituto Federal de Educação, Ciência e Tecnologia da Paraíba (IFPB)

DOI: https://doi.org/10.5753/sbbd.2025.247251

Resumo

Prompt engineering techniques have seen a significant rise in research interest as a means to achieve satisfactory results without retraining Language Models. This work presents a set of experiments to analyze the power of a combination of prompts. To this end, it evaluates six prompt techniques, combining them to result in twelve experimental scenarios applied to Python code generation. Evaluation using BERTScore indicates that Role combined with RAG achieves the highest performance in code generation with 98% similarity. Skeleton-of-Thought and Self-Verification reveal promising opportunities for the design of prompt templates. Our findings contribute to unveiling the power of combining prompt techniques for current applications such as code generation.

Palavras-chave: Prompt Engineering, LLM, Code Generation

Referências

Bansal, P. (2024). Prompt engineering importance and applicability with generative ai. Journal of Computer and Communications, 12.

Chen, M., Tworek, J., Jun, H., Yuan, Q., Pinto, H. P. D. O., Kaplan, J., Edwards, H., Burda, Y., Joseph, N., Brockman, G., et al. (2021). Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374.

Coignion, T., Quinton, C., and Rouvoy, R. (2024). A performance study of llm-generated code on leetcode. In Proceedings of the 28th International Conference on Evaluation and Assessment in Software Engineering, pages 79–89.

Damke, G., Gregorini, D., and Copetti, L. (2024). Avaliação da performance e corretude na geração de código através de técnicas de engenharia de prompt: Um estudo comparativo. In Anais do XXI Congresso Latino-Americano de Software Livre e Tecnologias Abertas, pages 400–403, Porto Alegre, RS, Brasil. SBC.

Deng, Y., Zhang, W., Chen, Z., and Gu, Q. (2023). Rephrase and respond: Let large language models ask better questions for themselves. arXiv preprint arXiv:2311.04205.

Devlin, J. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.

Dodge, Y. (2008). The concise encyclopedia of statistics. Springer Science & Business Media.

Gouveia, T., Albuquerque, K. M. M., Oliveira, J. D., and Maciel, V. M. B. C. (2023). C073: ferramenta para apoio ao ensino de programação usando a metodologia de aprendizagem baseada em problemas. Revista Principia, 60(1):70–87.

Hu, T. and Zhou, X.-H. (2024). Unveiling llm evaluation focused on metrics: Challenges and solutions. arXiv preprint arXiv:2404.09135.

Kojima, T., Gu, S. S., Reid, M., Matsuo, Y., and Iwasawa, Y. (2022). Large language models are zero-shot reasoners. Advances in neural information processing systems, 35:22199–22213.

Korzyński, P., Mazurek, G., Krzypkowska, P., and Kurasiński, A. (2023). Artificial intelligence prompt engineering as a new digital competence: Analysis of generative ai technologies such as chatgpt. Entrepreneurial Business and Economics Review.

Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., Küttler, H., Lewis, M., Yih, W.-t., Rocktäschel, T., et al. (2020). Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems, 33:9459–9474.

Mahir, A., Shohel, M. M. C., and Sall, W. (2024). The Role of AI in Programming Education: An Exploration of the Effectiveness of Conversational Versus Structured Prompting, pages 319–352. Practitioner Research in College-Based Education.

Medeiros, A., Cavalcante, C., Nepomuceno, J., Lago, L., Ruberg, N., and Lifschitz, S. (2024). Contrato360: uma aplicação de perguntas e respostas usando modelos de linguagem, documentos e bancos de dados. In Anais do XXXIX Simpósio Brasileiro de Bancos de Dados, pages 155–166, Porto Alegre, RS, Brasil. SBC.

Neves, B., Sousa, T., Coutinho, D., Garcia, A., and Pereira, J. (2024). Explorando o potencial e a viabilidade de llms open-source na análise de sentimentos. In Anais Estendidos do XV Congresso Brasileiro de Software: Teoria e Prática, pages 89–98, Porto Alegre, RS, Brasil. SBC.

Ning, X., Lin, Z., Zhou, Z., Wang, Z., Yang, H., and Wang, Y. (2024). Skeleton-of-thought: Large language models can do parallel decoding. Proceedings ENLSP-III.

Reynolds, L. and McDonell, K. (2021). Prompt programming for large language models: Beyond the few-shot paradigm. In Extended abstracts of the 2021 CHI conference on human factors in computing systems, pages 1–7.

Sabit, E. (2023). Prompt engineering for chatgpt: a quick guide to techniques, tips, and best practices. Techrxiv preprint 10.36227/techrxiv.22683919.

Sarker, L., Downing, M., Desai, A., and Bultan, T. (2024). Syntactic robustness for llm-based code generation. arXiv preprint arXiv:2404.01535.

Schulhoff, S., Ilie, M., Balepur, N., Kahadze, K., Liu, A., Si, C., Li, Y., Gupta, A., Han, H., Schulhoff, S., Dulepet, P. S., Vidyadhara, S., Ki, D., Agrawal, S., Pham, C., Kroiz, G., Li, F., Tao, H., Srivastava, A., Costa, H. D., Gupta, S., Rogers, M. L., Goncearenco, I., Sarli, G., Galynker, I., Peskoff, D., Carpuat, M., White, J., Anadkat, S., Hoyle, A., and Resnik, P. (2025). The prompt report: A systematic survey of prompting techniques. arXiv preprint arXiv:2406.06608.

Shin, J., Tang, C., Mohati, T., Nayebi, M., Wang, S., and Hemmati, H. (2023). Prompt engineering or fine tuning: An empirical assessment of large language models in automated software engineering tasks. ArXiv, abs/2310.10508.

Vatsal, S. and Dubey, H. (2024). A survey of prompt engineering methods in large language models for different nlp tasks. ArXiv, abs/2407.12994.

Wang, T., Zhou, N., and Chen, Z. (2024). Enhancing computer programming education with llms: A study on effective prompt engineering for python code generation. arXiv preprint arXiv:2407.05437.

Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., Chi, E., Le, Q. V., Zhou, D., et al. (2022). Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837.

Weng, Y., Zhu, M., Xia, F., Li, B., He, S., Liu, S., Sun, B., Liu, K., and Zhao, J. (2022). Large language models are better reasoners with self-verification. arXiv preprint arXiv:2212.09561.

Woolson, R. F. (2005). Wilcoxon signed-rank test. Encyclopedia of biostatistics, 8.

Zheng, K., Decugis, J., Gehring, J., Cohen, T., Negrevergne, B., and Synnaeve, G. (2024). What makes large language models reason in (multi-turn) code generation? arXiv preprint arXiv:2410.08105.

Zhou, Y., Muresanu, A. I., Han, Z., Paster, K., Pitis, S., Chan, H., and Ba, J. (2022). Large language models are human-level prompt engineers. arXiv preprint arXiv:2211.01910.