A Modular Architecture Proposal for Multi-Turn Conversational RAG Systems

  • Guilherme C. Dutra UFG
  • André Felipe dos S. Caraíba UFG
  • João Pedro A. F. Matos UFG
  • Nádia F. F. da Silva UFG
  • Deborah S. A. Fernandes UFG
  • Sávio S. T. de Oliveira UFG

Resumo


Conversational systems face growing challenges in understanding context, resolving references, and maintaining coherence across multiple user turns. The SemEval-2026 Task 8 [Katsis et al. 2026] challenges participants to build conversational Retrieval-Augmented Generation (RAG) systems capable of handling multi-turn interactions with context dependencies, coreferences, and diverse question types. We propose a modular architecture combining five complementary strategies: (1) CoT query rewriting with multi-query diversification; (2) hybrid BM25+dense search; (3) rerank relevant documents; (4) answerability detection; and (5) specialized guardrails. Our contribution demonstrates how systematic integration of classical IR techniques with advanced prompting tackles SemEval-2026 Task 8 requirements.
Palavras-chave: Conversational RAG, Multi-Turn Systems, Query Rewriting, Hybrid Retrieval, Guardrails

Referências

Adeyemi, M., Oladipo, A., Pradeep, R., and Lin, J. (2024). Zero-shot cross-lingual reranking with large language models for low-resource languages. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 650–656.

Aliannejadi, M., Abbasiantaeb, Z., Chatterjee, S., Dalton, J., and Azzopardi, L. (2024). TREC iKAT 2023: A test collection for evaluating conversational and interactive knowledge assistants. In Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 819–829.

Bassani, E. and Sanchez, I. (2024). GuardBench: A large-scale benchmark for guardrail models. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 18393–18409.

Breuer, T. (2024). Data fusion of synthetic query variants with generative large language models. In Proceedings of the 2024 Annual International ACM SIGIR Conference on Research and Development in Information Retrieval in the Asia Pacific Region, pages 274–279.

Chen, J. and Mueller, J. (2024). Quantifying uncertainty in answers from any language model and enhancing their trustworthiness. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5186–5200.

Katsis, Y., Rosenthal, S., Fadnis, K., Gunasekara, C., Lee, Y.-S., Popa, L., Shah, V., Zhu, H., Contractor, D., and Danilevsky, M. (2025). mtrag: A multi-turn conversational benchmark for evaluating retrieval-augmented generation systems. Transactions of the Association for Computational Linguistics, 13:784–808.

Katsis, Y., Rosenthal, S., Fadnis, K., Gunasekara, C., Lee, Y.-S., Popa, L., Shah, V., Zhu, H., Contractor, D., and Danilevsky, M. (2026). SemEval-2026 Task 8: Multi-Turn Conversational RAG Evaluation. [link]. Accessed: 2025-01-15.

Kuo, T.-L., Liao, F., Hsieh, M.-W., Chang, F.-C., Hsu, P.-C., and Shiu, D.-s. (2025). RAD-Bench: Evaluating large language models’ capabilities in retrieval augmented dialogues. In Proceedings of the 2025 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 3: Industry Track, pages 868–902.

Lee, Y., Kim, M., and Hwang, S.-w. (2024). Disentangling questions from query generation for task-adaptive retrieval. In Findings of the Association for Computational Linguistics: EMNLP 2024, pages 4775–4785.

Li, Z., Liu, H., Zhou, D., and Ma, T. (2024). Chain of thought empowers transformers to solve inherently serial problems. In The Twelfth International Conference on Learning Representations.

OpenAI (2024). Gpt-4o mini: advancing cost-efficient intelligence. [link]. Accessed: 08 Oct. 2025.

Qdrant Team (2024). Qdrant - high-performance, massive-scale vector database and vector search engine. [link]. Accessed: 08 Oct. 2025.

Rebedea, T., Derczynski, L., Ghosh, S., Sreedhar, M. N., Brahman, F., Jiang, L., Li, B., Tsvetkov, Y., Parisien, C., and Choi, Y. (2025). Guardrails and security for LLMs: Safe, secure and controllable steering of LLM applications. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 5: Tutorial Abstracts), pages 13–15.

Reddy, R. G., Doo, J., Xu, Y., Sultan, M. A., Swain, D., Sil, A., and Ji, H. (2024). FIRST: Faster improved listwise reranking with single token decoding. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 8642–8652.

Shi, Z., Castellucci, G., Filice, S., Kuzi, S., Kravi, E., Agichtein, E., Rokhlenko, O., and Malmasi, S. (2025). Ambiguity detection and uncertainty calibration for question answering with large language models. In Proceedings of the 5th Workshop on Trustworthy NLP, pages 41–55.

Song, J., Wang, X., Zhu, J., Wu, Y., Cheng, X., Zhong, R., and Niu, C. (2024). RAG-HAT: A hallucination-aware tuning pipeline for LLM in retrieval-augmented generation. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track, pages 1548–1558.

Wang, X. and Zhou, D. (2025). Chain-of-thought reasoning without prompting. In Proceedings of the 38th International Conference on Neural Information Processing Systems, NIPS ’24, Red Hook, NY, USA. Curran Associates Inc.

Wei, J., Wang, X., Schuurmans, D., Bosma, M., Ichter, B., Xia, F., Chi, E. H., Le, Q. V., and Zhou, D. (2022). Chain-of-thought prompting elicits reasoning in large language models. In Proceedings of the 36th International Conference on Neural Information Processing Systems, NIPS ’22, Red Hook, NY, USA. Curran Associates Inc.

Xia, Z., Xu, J., Zhang, Y., and Liu, H. (2025). A survey of uncertainty estimation methods on large language models. In Findings of the Association for Computational Linguistics: ACL 2025, pages 21381–21396.

Ye, Q., Ahmed, M., Pryzant, R., and Khani, F. (2024). Prompt engineering a prompt engineer. In Findings of the Association for Computational Linguistics: ACL 2024, pages 355–385.

Zhang, L., Wu, Y., Yang, Q., and Nie, J.-Y. (2024). Exploring the best practices of query expansion with large language models. In Findings of the Association for Computational Linguistics: EMNLP 2024, pages 1872–1883.

Zhuang, S., Ma, X., Koopman, B., Lin, J., and Zuccon, G. (2024). PromptReps: Prompting large language models to generate dense and sparse representations for zero-shot document retrieval. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 4375–4391.
Publicado
04/12/2025
DUTRA, Guilherme C.; CARAÍBA, André Felipe dos S.; MATOS, João Pedro A. F.; SILVA, Nádia F. F. da; FERNANDES, Deborah S. A.; OLIVEIRA, Sávio S. T. de. A Modular Architecture Proposal for Multi-Turn Conversational RAG Systems. In: ESCOLA REGIONAL DE INFORMÁTICA DE GOIÁS (ERI-GO), 13. , 2025, Luziânia/GO. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2025 . p. 370-373. DOI: https://doi.org/10.5753/erigo.2025.17155.