Enhancing Epidemiological Insights with RAG for SIREVA-SUS Reports
Resumo
This study introduces a Retrieval-Augmented Generation (RAG) framework designed to extract and generate epidemiological insights from SIREVA-SUS reports, Brazil’s national surveillance system for bacterial pathogens. By integrating dense information retrieval with generative language models, this approach facilitates explainable question answering over extensive, unstructured epidemiological data, aiding healthcare professionals and researchers in detecting trends, outbreaks, and antimicrobial resistance patterns. The evaluation includes a comparison of various large language models (LLMs), such as the multilingual qwen and Portuguese fine-tuned models Sabia, port5, and Bertimbau. Additionally, the architecture examines multiple retrieval strategies, including vector, dense, sparse, hybrid, and reranker methods. Preliminary findings suggest that the system effectively retrieves relevant information with reasonable BERTScore. This work reinforces the importance of language-specific evaluation and opens new directions for deploying RAG systems in public health decision support.Referências
Aguilar-Vargas, F., Solorzano-Scott, T., Baldi, M., Barquero-Calvo, E., Jiménez-Rocha, A., Jiménez, C., Piche-Ovares, M., Dolz, G., León, B., Corrales-Aguilar, E., et al. (2022). Passive epidemiological surveillance in wildlife in costa rica identifies pathogens of zoonotic and conservation importance. PLoS One, 17(9):e0262063.
AI, M. (2024). Sabiá: Large language models for portuguese. [link]. Accessed: 2024-06-06.
Amugongo, L. M., Mascheroni, P., Brooks, S., Doering, S., and Seidel, J. (2025). Retrieval augmented generation for large language models in healthcare: A systematic review. PLOS Digital Health, 4(6):e0000877.
Ateia, S. and Kruschwitz, U. (2025). Bioragent: A retrieval-augmented generation system for showcasing generative query expansion and domain-specific search for scientific q&a. In European Conference on Information Retrieval, pages 1–5. Springer.
Bai, J., Bai, S., Chu, Y., Cui, Z., Dang, K., Deng, X., Fan, Y., Ge, W., Han, Y., Huang, F., Hui, B., Ji, L., Li, M., Lin, J., Lin, R., Liu, D., Liu, G., Lu, C., Lu, K., Ma, J., Men, R., Ren, X., Ren, X., Tan, C., Tan, S., Tu, J., Wang, P., Wang, S., Wang, W., Wu, S., Xu, B., Xu, J., Yang, A., Yang, H., Yang, J., Yang, S., Yao, Y., Yu, B., Yuan, H., Yuan, Z., Zhang, J., Zhang, X., Zhang, Y., Zhang, Z., Zhou, C., Zhou, J., Zhou, X., and Zhu, T. (2023). Qwen technical report. arXiv preprint arXiv:2309.16609.
Carmo, D., Piau, M., Campiotti, I., Nogueira, R., and Lotufo, R. (2020). Ptt5: Pretraining and validating the t5 model on brazilian portuguese data. arXiv preprint arXiv:2008.09144.
Docling Project (2024). Docling - toolkit for document parsing and structuring. [link]. Accessed: 2024-06-06.
Fan, W., Ding, Y., Ning, L., Wang, S., Li, H., Yin, D., Chua, T.-S., and Li, Q. (2024). A survey on rag meeting llms: Towards retrieval-augmented large language models. In Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 6491–6501.
Gilson, A., Ai, X., Arunachalam, T., Chen, Z., Cheong, K. X., Dave, A., Duic, C., Kibe, M., Kaminaka, A., Prasad, M., et al. (2024). Enhancing large language models with domain-specific retrieval augment generation: A case study on long-form consumer health question answering in ophthalmology. arXiv preprint arXiv:2409.13902.
Ihekweazu, C., Yinka-Ogunleye, A., Lule, S., and Ibrahim, A. (2020). Importance of epidemiological research of monkeypox: is incidence increasing? Expert review of anti-infective therapy, 18(5):389–392.
Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., Küttler, H., Lewis, M., Yih, W.-t., Rocktäschel, T., et al. (2020). Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in neural information processing systems, 33:9459–9474.
Pan American Health Organization (2024). SIREVA: Regional System for Vaccines. [link]. Accessed: 2024-05-17.
Pruccoli, G., Castagno, E., Raffaldi, I., Denina, M., Barisone, E., Baroero, L., Timeus, F., Rabbone, I., Monzani, A., Terragni, G. M., et al. (2023). The importance of rsv epidemiological surveillance: a multicenter observational study of rsv infection during the covid-19 pandemic. Viruses, 15(2):280.
Red SIREVA Network (2024). Brasil – SIREVA. [link]. Accessed: 2024-05-17.
Souza, F., Nogueira, R., and Lotufo, R. (2020). Bertimbau: pretrained bert models for brazilian portuguese. In Brazilian conference on intelligent systems, pages 403–417. Springer.
Xiong, G., Jin, Q., Lu, Z., and Zhang, A. (2024a). Benchmarking retrieval-augmented generation for medicine. In Findings of the Association for Computational Linguistics ACL 2024, pages 6233–6251.
Xiong, G., Jin, Q., Wang, X., Zhang, M., Lu, Z., and Zhang, A. (2024b). Improving retrieval-augmented generation in medicine with iterative follow-up questions. In Biocomputing 2025: Proceedings of the Pacific Symposium, pages 199–214. World Scientific.
Zhu, Y., Ren, C., Wang, Z., Zheng, X., Xie, S., Feng, J., Zhu, X., Li, Z., Ma, L., and Pan, C. (2024). Emerge: Enhancing multimodal electronic health records predictive modeling with retrieval-augmented generation. In Proceedings of the 33rd ACM International Conference on Information and Knowledge Management, pages 3549–3559.
Ziletti, A. and D’Ambrosi, L. (2024). Retrieval augmented text-to-sql generation for epidemiological question answering using electronic health records. arXiv preprint arXiv:2403.09226.
AI, M. (2024). Sabiá: Large language models for portuguese. [link]. Accessed: 2024-06-06.
Amugongo, L. M., Mascheroni, P., Brooks, S., Doering, S., and Seidel, J. (2025). Retrieval augmented generation for large language models in healthcare: A systematic review. PLOS Digital Health, 4(6):e0000877.
Ateia, S. and Kruschwitz, U. (2025). Bioragent: A retrieval-augmented generation system for showcasing generative query expansion and domain-specific search for scientific q&a. In European Conference on Information Retrieval, pages 1–5. Springer.
Bai, J., Bai, S., Chu, Y., Cui, Z., Dang, K., Deng, X., Fan, Y., Ge, W., Han, Y., Huang, F., Hui, B., Ji, L., Li, M., Lin, J., Lin, R., Liu, D., Liu, G., Lu, C., Lu, K., Ma, J., Men, R., Ren, X., Ren, X., Tan, C., Tan, S., Tu, J., Wang, P., Wang, S., Wang, W., Wu, S., Xu, B., Xu, J., Yang, A., Yang, H., Yang, J., Yang, S., Yao, Y., Yu, B., Yuan, H., Yuan, Z., Zhang, J., Zhang, X., Zhang, Y., Zhang, Z., Zhou, C., Zhou, J., Zhou, X., and Zhu, T. (2023). Qwen technical report. arXiv preprint arXiv:2309.16609.
Carmo, D., Piau, M., Campiotti, I., Nogueira, R., and Lotufo, R. (2020). Ptt5: Pretraining and validating the t5 model on brazilian portuguese data. arXiv preprint arXiv:2008.09144.
Docling Project (2024). Docling - toolkit for document parsing and structuring. [link]. Accessed: 2024-06-06.
Fan, W., Ding, Y., Ning, L., Wang, S., Li, H., Yin, D., Chua, T.-S., and Li, Q. (2024). A survey on rag meeting llms: Towards retrieval-augmented large language models. In Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 6491–6501.
Gilson, A., Ai, X., Arunachalam, T., Chen, Z., Cheong, K. X., Dave, A., Duic, C., Kibe, M., Kaminaka, A., Prasad, M., et al. (2024). Enhancing large language models with domain-specific retrieval augment generation: A case study on long-form consumer health question answering in ophthalmology. arXiv preprint arXiv:2409.13902.
Ihekweazu, C., Yinka-Ogunleye, A., Lule, S., and Ibrahim, A. (2020). Importance of epidemiological research of monkeypox: is incidence increasing? Expert review of anti-infective therapy, 18(5):389–392.
Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., Küttler, H., Lewis, M., Yih, W.-t., Rocktäschel, T., et al. (2020). Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in neural information processing systems, 33:9459–9474.
Pan American Health Organization (2024). SIREVA: Regional System for Vaccines. [link]. Accessed: 2024-05-17.
Pruccoli, G., Castagno, E., Raffaldi, I., Denina, M., Barisone, E., Baroero, L., Timeus, F., Rabbone, I., Monzani, A., Terragni, G. M., et al. (2023). The importance of rsv epidemiological surveillance: a multicenter observational study of rsv infection during the covid-19 pandemic. Viruses, 15(2):280.
Red SIREVA Network (2024). Brasil – SIREVA. [link]. Accessed: 2024-05-17.
Souza, F., Nogueira, R., and Lotufo, R. (2020). Bertimbau: pretrained bert models for brazilian portuguese. In Brazilian conference on intelligent systems, pages 403–417. Springer.
Xiong, G., Jin, Q., Lu, Z., and Zhang, A. (2024a). Benchmarking retrieval-augmented generation for medicine. In Findings of the Association for Computational Linguistics ACL 2024, pages 6233–6251.
Xiong, G., Jin, Q., Wang, X., Zhang, M., Lu, Z., and Zhang, A. (2024b). Improving retrieval-augmented generation in medicine with iterative follow-up questions. In Biocomputing 2025: Proceedings of the Pacific Symposium, pages 199–214. World Scientific.
Zhu, Y., Ren, C., Wang, Z., Zheng, X., Xie, S., Feng, J., Zhu, X., Li, Z., Ma, L., and Pan, C. (2024). Emerge: Enhancing multimodal electronic health records predictive modeling with retrieval-augmented generation. In Proceedings of the 33rd ACM International Conference on Information and Knowledge Management, pages 3549–3559.
Ziletti, A. and D’Ambrosi, L. (2024). Retrieval augmented text-to-sql generation for epidemiological question answering using electronic health records. arXiv preprint arXiv:2403.09226.
Publicado
29/09/2025
Como Citar
FREITAS, Christian; RABONATO, Ricardo Trainotti; BERTON, Lilian.
Enhancing Epidemiological Insights with RAG for SIREVA-SUS Reports. In: ENCONTRO NACIONAL DE INTELIGÊNCIA ARTIFICIAL E COMPUTACIONAL (ENIAC), 22. , 2025, Fortaleza/CE.
Anais [...].
Porto Alegre: Sociedade Brasileira de Computação,
2025
.
p. 1364-1375.
ISSN 2763-9061.
DOI: https://doi.org/10.5753/eniac.2025.11796.
