Performance and Correctness Evaluation in Code Generation Using Prompt Engineering Techniques: A Comparative Study

  • Gabriel Trevisan Damke UTFPR
  • Daniel Mahl Gregorini UTFPR
  • Luana Copetti UTFPR

Abstract


Prompt engineering is a relatively new process that plays a crucial role in the effectiveness of language models, including in tasks such as code generation. This research aims to compare the performance of different prompt engineering techniques in code generation. The study evaluates the techniques based on two main metrics: correctness and performance of the generated code. A small dataset was manually created with 12 exercises of different difficulty levels, with trade-offs between time and space. Performance was measured by the ability to obtain the answer with optimal asymptotic complexity given a specific constraint. The data were analyzed using the Pass@k metric, where k was defined as 1 and 3. Two distinct language models were defined to obtain the results: Meta Llama 3 8B and Google Gemma 7B.
Keywords: Prompt Engineering, Artificial Intelligence, Code Generation

References

L. Reynolds and K. McDonell, "Prompt programming for large language models: Beyond the few-shot paradigm," in *Extended Abstracts of the 2021 CHI Conference on Human Factors in Computing Systems*, 2021, pp. 1-7.

A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever, "Language models are unsupervised multitask learners," OpenAI, 2019. [Online]. Available: [link]. [Acessado em: Aug. 16, 2024].

T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. Amodei, “Language Models are Few-Shot Learners,” in Advances in Neural Information Processing Systems 33 (NeurIPS 2020), 2020.

J. Wei, X. Wang, D. Schuurmans, M. Bosma, B. Ichter, F. Xia, E. H. Chi, Q. V. Le, and D. Zhou, "Chain-of-thought prompting elicits reasoning in large language models," in Proceedings of the 36th International Conference on Neural Information Processing Systems (NeurIPS '22), Red Hook, NY, USA: Curran Associates Inc., 2024, pp. 24824–24837.

M. Chen, J. Tworek, H. Jun, Q. Yuan, H. Ponde de Oliveira Pinto, J. Kaplan, H. Edwards, Y. Burda, N. Joseph, G. Brockman, A. Ray, R. Puri, G. Krueger, M. Petrov, H. Khlaaf, G. Sastry, P. Mishkin, B. Chan, S. Gray, N. Ryder, M. Pavlov, A. Power, L. Kaiser, M. Bavarian, C. Winter, P. Tillet, F. Petroski Such, D. Cummings, M. Plappert, F. Chantzis, E. Barnes, A. Herbert-Voss, W. H. Guss, A. Nichol, A. Paino, N. Tezak, J. Tang, I. Babuschkin, S. Balaji, S. Jain, W. Saunders, C. Hesse, A. N. Carr, J. Leike, J. Achiam, V. Misra, E. Morikawa, A. Radford, M. Knight, M. Brundage, M. Murati, K. Mayer, P. Welinder, B. McGrew, D. Amodei, S. McCandlish, I. Sutskever, W. Zaremba, "Evaluating Large Language Models Trained on Code," arXiv preprint arXiv:2107.03374, 2021. Available: [link].

J. L. Hennessy and D. A. Patterson, Computer Architecture: A Quantitative Approach, 5th ed. Waltham, MA, USA: Morgan Kaufmann, 2011
Published
2024-11-27
DAMKE, Gabriel Trevisan; GREGORINI, Daniel Mahl; COPETTI, Luana. Performance and Correctness Evaluation in Code Generation Using Prompt Engineering Techniques: A Comparative Study. In: LATIN AMERICAN CONGRESS ON FREE SOFTWARE AND OPEN TECHNOLOGIES (LATINOWARE), 21. , 2024, Foz do Iguaçu/PR. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2024 . p. 400-403. DOI: https://doi.org/10.5753/latinoware.2024.245745.