A Goal-Oriented Chat-Like System for Evaluation of Large Language Models

Guilherme S. Teodoro Junior; Sarajane M. Peres; Marcelo Fantinato; Anarosa A. F. Brandão; Fabio G. Cozman

doi:10.5753/eniac.2024.245208

Guilherme S. Teodoro Junior USP
Sarajane M. Peres USP
Marcelo Fantinato USP
Anarosa A. F. Brandão USP
Fabio G. Cozman USP

DOI: https://doi.org/10.5753/eniac.2024.245208

Resumo

Large language models have changed the way various applications are developed. Interactions with large language models have reached a new level of complexity and now act as real problem solvers. However, despite their apparent competence, it is still necessary to accredit them with respect to the tasks they are assigned. In this paper, we discuss a systemic approach to accredit large language models through their integration with a goal-oriented chat-like system. An experiment involving prompt engineering for two models from the GPT family illustrates our evaluation scheme when applied to a real-world chatbot use case; our evaluation scheme reveals, that the resulting chatbots perform well but are not yet ready for real-world dialogues under specific requirements.

Palavras-chave: Large Language Models, Large Language Models Evaluation, Conversational Agents

Referências

Altman, D., Machin, D., Bryant, T., and Gardner, M. (2013). Statistics with Confidence: Confidence Intervals and Statistical Guidelines. Wiley.

Bommasani, R. et al. (2022). On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258.

Brown, T. et al. (2020). Language models are few-shot learners. In Advances in Neural Inf. Processing Syst., volume 33, pages 1877–1901.

Chang, K., Xu, S., Wang, C., Luo, Y., Xiao, T., and Zhu, J. (2024a). Efficient prompting methods for large language models: A survey. arXiv preprint arXiv:2404.01077.

Chang, Y. et al. (2024b). A survey on evaluation of large language models. ACM Trans. Intell. Syst. Technol., 15(3).

Chen, B., Zhang, Z., Langrené, N., and Zhu, S. (2024). Unleashing the potential of prompt engineering: a comprehensive review. arXiv preprint arXiv:2310.14735.

Chowdhury, A. G. et al. (2024). Breaking down the defenses: A comparative survey of attacks on large language models. arXiv preprint arXiv:2403.04786.

Floridi, L. and Cowls, J. (2022). A unified framework of five principles for ai in society. Machine learning and the city: Applications in architecture and urban design, pages 535–545.

Jurafsky, D, M.-J. H. (2024). Speech and Language Processing. 3rd (draft) edition.

Kojima, T., Gu, S. S., Reid, M., Matsuo, Y., and Iwasawa, Y. (2023). Large language models are zero-shot reasoners. arXiv preprint arXiv:2205.11916.

Kvam, P. H. and Vidakovic, B. (2007). Nonparametric Statistics with Applications to Science and Engineering. Wiley-Interscience, USA.

Lee, S., Lim, H., and Sedoc, J. (2020). An evaluation protocol for generative conversational systems. arXiv preprint arXiv:2010.12741.

Liang, P. et al. (2023). Holistic evaluation of language models. Transactions on Machine Learning Research.

Liu, P., Yuan, W., Fu, J., Jiang, Z., Hayashi, H., and Neubig, G. (2023). Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Comput. Surv., 55(9).

Pfahl, D., Yin, H., Mäntylä, M. V., and Münch, J. (2014). How is exploratory testing used? a state-of-the-practice survey. In Proc. of the 8th ACM/IEEE Int. Symp. on Empirical Software Eng. and Meas., New York, NY, USA. ACM.

Sai, A. B., Mohankumar, A. K., and Khapra, M. M. (2022). A survey of evaluation metrics used for nlg systems. ACM Comput. Surv., 55(2).

Sedoc, J., Ippolito, D., Kirubarajan, A., Thirani, J., Ungar, L., and Callison-Burch, C. (2019). ChatEval: A tool for chatbot evaluation. In Proc. of the 2019 Conf. of the North American Chapter of the ACL (Demonstrations), pages 60–65. ACL.

Shankar, S., Zamfirescu-Pereira, J. D., Hartmann, B., Parameswaran, A. G., and Arawjo, I. (2024). Who validates the validators? aligning LLM-assisted evaluation of LLM outputs with human preferences. arXiv preprint arXiv:2404.12272.

van der Lee, C., Gatt, A., van Miltenburg, E., Wubben, S., and Krahmer, E. (2019). Best practices for the human evaluation of automatically generated text. In Proc. of the 12th Int. Conf. on Nat. Lang. Gener., pages 355–368. ACL.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L. u., and Polosukhin, I. (2017). Attention is all you need. In Adv. in Neural Inf. Proces. Syst., volume 30.

Wu, T., Terry, M., and Cai, C. J. (2022). AI Chains: Transparent and controllable human-ai interaction by chaining large language model prompts. In Proc. of the 2022 Conf. on Hum. Factors in Comp. Syst. ACM.

Zhao, W. X. et al. (2023). A survey of large language models. arXiv preprint arXiv:2303.18223.

A Goal-Oriented Chat-Like System for Evaluation of Large Language Models

Resumo

Referências

Artigos mais lidos do(s) mesmo(s) autor(es)