MOPrompt: Multi-objective Semantic Evolution for Prompt Optimization

Sara Câmara; Eduardo Luz; Valéria Carvalho; Ivan Reinaldo Meneghini; Gladston Moreira

doi:10.5753/stil.2025.37815

Sara Câmara UFOP
Eduardo Luz UFOP
Valéria Carvalho UFOP
Ivan Reinaldo Meneghini IFMG
Gladston Moreira UFOP

DOI: https://doi.org/10.5753/stil.2025.37815

Resumo

Prompt engineering is crucial for unlocking the potential of Large Language Models (LLMs). Still, since manual prompt design is often complex, non-intuitive, and time-consuming, automatic prompt optimization has emerged as a research area. However, a significant challenge in prompt optimization is managing the inherent trade-off between task performance, such as accuracy, and context size. Most existing automated methods focus on a single objective, typically performance, thereby failing to explore the critical spectrum of efficiency and effectiveness. This paper introduces the MOPrompt, a novel Multi-objective Evolutionary Optimization (EMO) framework designed to optimize prompts for both accuracy and context size (measured in tokens) simultaneously. Our framework maps the Pareto front of prompt solutions, presenting practitioners with a set of trade-offs between context size and performance – a crucial tool for deploying Large Language Models (LLMs) in real-world applications. We evaluate MOPrompt on a sentiment analysis task in Portuguese, using Gemma-2B and Sabiazinho-3 as evaluation models. Our findings show that MOPrompt substantially outperforms the baseline framework. For the Sabiazinho model, MOPrompt identifies a prompt that achieves the same peak accuracy (0.97) as the best baseline solution, but with a 31% reduction in token length.

Referências

Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F. L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al. (2023). Gpt-4 technical report. preprint arXiv:2303.08774.

Baumann, J. and Kramer, O. (2024). Evolutionary multi-objective optimization of large language model prompts for balancing sentiments. In International Conference on the Applications of Evolutionary Computation (Part of EvoStar), pages 212–224.

Deb, K., Pratap, A., Agarwal, S., and Meyarivan, T. (2002). A fast and elitist multiobjective genetic algorithm: Nsga-ii. IEEE transactions on evolutionary computation, 6(2):182–197.

Fernando, C., Banarse, D., Michalewski, H., Osindero, S., and Rocktäschel, T. (2024). Promptbreeder: self-referential self-improvement via prompt evolution. In Proceedings of the 41st International Conference on Machine Learning.

Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., and Yang, Y. (2024). Connecting large language models with evolutionary algorithms yields powerful prompt optimizers. In International Conference on Learning Representations (ICLR).

Korzynski, P., Mazurek, G., Krzypkowska, P., and Kurasinski, A. (2023). Artificial intelligence prompt engineering as a new digital competence: Analysis of generative ai technologies such as chatgpt. Entrepreneurial Business and Economics Review, 11(3):25–37.

Li, Y. B. and Wu, K. (2023). Spell: Semantic prompt evolution based on a llm. preprint arXiv:2310.01260.

Maas, A. L., Daly, R. E., Pham, P. T., Huang, D., Ng, A. Y., and Potts, C. (2011). Learning word vectors for sentiment analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pages 142–150.

Martin, N., Faisal, A. B., Eltigani, H., Haroon, R., Lamelas, S., and Dogar, F. (2024). Llmproxy: Reducing cost to access large language models. preprint arXiv:2410.11857.

Marvin, G., Hellen, N., Jjingo, D., and Nakatumba-Nabende, J. (2023). Prompt engineering in large language models. In International conference on data intelligence and cognitive informatics, pages 387–402.

Moreira, G. and Paquete, L. (2019). Guiding under uniformity measure in the decision space. In 2019 IEEE Congress on Evolutionary Computation (CEC), pages 1536–1542.

Oliveira, A., Silva, P. H., Santos, V., Moreira, G., Freitas, V. L., and Luz, E. J. (2024). Toxic text classification in portuguese: Is llama 3.1 8b all you need? In Symposium in Information and Human Language Technology, pages 57–66.

Pawar, S., Tonmoy, S. M. T. I., Zaman, S. M. M., Jain, V., Chadha, A., and Das, A. (2024). The what, why, and how of context length extension techniques in large language models – a detailed survey. preprint arXiv:2401.07872.

Pires, R., Abonizio, H., Almeida, T. S., and Nogueira, R. (2023). Sabiá: Portuguese large language models.

Sahoo, P., Singh, A. K., Saha, S., Jain, V., Mondal, S., and Chadha, A. (2024). A systematic survey of prompt engineering in large language models: Techniques and applications. preprint arXiv:2402.07927.

Team, G., Mesnard, T., Hardin, C., Dadashi, R., Bhupatiraju, S., Pathak, S., Sifre, L., Rivière, M., Kale, M. S., Love, J., et al. (2024). Gemma: Open models based on gemini research and technology. preprint arXiv:2403.08295.

Zhou, Y., Muresanu, A. I., Han, Z., Paster, K., Pitis, S., Chan, H., and Ba, J. (2022). Large language models are human-level prompt engineers. In The Eleventh International Conference on Learning Representations.