Automated testing framework to evaluate multi-agent chat assistants

Lucas Ramalho; Jose Sousa; Maria Nascimento; Raiza Hanada; Cristian Souza; Eliane Collins

doi:10.5753/cibse.2026.42466

Lucas Ramalho INDT
Jose Sousa INDT
Maria Nascimento INDT
Raiza Hanada INDT
Cristian Souza INDT
Eliane Collins INDT

DOI: https://doi.org/10.5753/cibse.2026.42466

Resumo

This applied R&D project proposes an automated framework for evaluating multi-agent conversational assistants equipped with retrieval-augmented generation (RAG) capabilities. The solution addresses the high cost and time demands of manual evaluation by introducing a synthetic persona dataset and an automated pipeline that executes large-scale tests on mobile devices. Tests with English and Portuguese personas revealed recurring weaknesses in multi-agent systems, particularly in Portuguese interactions, highlighting the importance of multilingual evaluation. The project is a collaboration between INDT and Motorola Mobility and aims to provide a systematic methodology and testing infrastructure for industry conversational systems.

Referências

Li, Y., Wen, H., Wang, W., Li, X., Yuan, Y., Liu, G., Liu, J., Xu, W., Wang, X., Sun, Y., Kong, R., Wang, Y., Geng, H., Luan, J., Jin, X., Ye, Z.-L., Xiong, G., Zhang, F., Li, X., Xu, M., Li, Z., Li, P., Liu, Y., Zhang, Y., and Liu, Y. (2024). Personal llm agents: Insights and survey about the capability, efficiency and security. ArXiv, abs/2401.05459.

Schick, T., Dwivedi-Yu, J., Dessı̀, R., Raileanu, R., Lomeli, M., Zettlemoyer, L., Cancedda, N., and Scialom, T. (2023). Toolformer: Language models can teach themselves to use tools.

Yukhymenko, H., Staab, R., Vero, M., and Vechev, M. (2025). A synthetic dataset for personal attribute inference. In Proceedings of the 38th International Conference on Neural Information Processing Systems, NIPS ’24, Red Hook, NY, USA. Curran Associates Inc.