Automated testing framework to evaluate multi-agent chat assistants
Resumo
This applied R&D project proposes an automated framework for evaluating multi-agent conversational assistants equipped with retrieval-augmented generation (RAG) capabilities. The solution addresses the high cost and time demands of manual evaluation by introducing a synthetic persona dataset and an automated pipeline that executes large-scale tests on mobile devices. Tests with English and Portuguese personas revealed recurring weaknesses in multi-agent systems, particularly in Portuguese interactions, highlighting the importance of multilingual evaluation. The project is a collaboration between INDT and Motorola Mobility and aims to provide a systematic methodology and testing infrastructure for industry conversational systems.Referências
Li, Y., Wen, H., Wang, W., Li, X., Yuan, Y., Liu, G., Liu, J., Xu, W., Wang, X., Sun, Y., Kong, R., Wang, Y., Geng, H., Luan, J., Jin, X., Ye, Z.-L., Xiong, G., Zhang, F., Li, X., Xu, M., Li, Z., Li, P., Liu, Y., Zhang, Y., and Liu, Y. (2024). Personal llm agents: Insights and survey about the capability, efficiency and security. ArXiv, abs/2401.05459.
Schick, T., Dwivedi-Yu, J., Dessı̀, R., Raileanu, R., Lomeli, M., Zettlemoyer, L., Cancedda, N., and Scialom, T. (2023). Toolformer: Language models can teach themselves to use tools.
Yukhymenko, H., Staab, R., Vero, M., and Vechev, M. (2025). A synthetic dataset for personal attribute inference. In Proceedings of the 38th International Conference on Neural Information Processing Systems, NIPS ’24, Red Hook, NY, USA. Curran Associates Inc.
Schick, T., Dwivedi-Yu, J., Dessı̀, R., Raileanu, R., Lomeli, M., Zettlemoyer, L., Cancedda, N., and Scialom, T. (2023). Toolformer: Language models can teach themselves to use tools.
Yukhymenko, H., Staab, R., Vero, M., and Vechev, M. (2025). A synthetic dataset for personal attribute inference. In Proceedings of the 38th International Conference on Neural Information Processing Systems, NIPS ’24, Red Hook, NY, USA. Curran Associates Inc.
Publicado
11/05/2026
Como Citar
RAMALHO, Lucas; SOUSA, Jose; NASCIMENTO, Maria; HANADA, Raiza; SOUZA, Cristian; COLLINS, Eliane.
Automated testing framework to evaluate multi-agent chat assistants. In: CONGRESSO IBERO-AMERICANO EM ENGENHARIA DE SOFTWARE (CIBSE), 29. , 2026, Recife/PE.
Anais [...].
Porto Alegre: Sociedade Brasileira de Computação,
2026
.
p. 400-403.
