Analysis of the Chunking Process in RAG Architectures
Resumo
This paper studies how text segmentation affects Retrieval Augmented Generation. We compare heuristic, semantic, and recursive strategies on a Portuguese institutional corpus, holding the pipeline fixed with bge base en v1.5 embeddings, a 500 token target size, and about 10% overlap. In the semantic variant, boundaries are detected by changes in similarity between sentence embeddings using a 95th percentile threshold. Evaluation covers intrinsic coherence and extrinsic metrics such as Factual Accuracy, Precision, Recall, F1-score, and MRR.
Referências
Barbosa, M., Valle, P., Nakamura, W., Guerino, G., Finger, A., Lunardi, G., and Silva, W. (2022). Um estudo exploratório sobre métodos de avaliação de user experience em chatbots. In Anais da VI Escola Regional de Engenharia de Software.
Barnett, S., Kurniawan, S., Thudumu, S., Brannelly, Z., and Abdelrazek, M. (2024). Seven failure points when engineering a retrieval augmented generation system. Proceedings of the IEEE/ACM 3rd International Conference on AI Engineering – Software Engineering for AI.
Brådland, H., Goodwin, M., Andersen, P.-A., Nossum, A. S., and Gupta, A. (2025). A new hope: Domain-agnostic automatic evaluation of text chunking.
Figueiredo, L. O., Lopes, A. M. Z., Validorio, V. C., and Mussio, S. C. (2023). Desafios e impactos do uso da inteligência artificial na educação. Educação Online.
Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., Küttler, H., Lewis, M., Yih, W.-t., Rocktäschel, T., Riedel, S., and Kiela, D. (2020). Retrieval-augmented generation for knowledge-intensive nlp tasks. In International Conference on Neural Information Processing Systems (NeurIPS).
Peng, Z., Liu, X., and Yang, G. (2025). Latesplit: Lightweight post-retrieval chunking for query-aligned text segmentation in rag systems. In 2025 IEEE 6th International Seminar on Artificial Intelligence, Networking and Information Technology (AINIT).
Soares, T., Costa, R., Soares, E., Calderon, I., Lunardi, G., Valle, P., Guedes, G., and Silva, W. (2025). Machine learning-assisted tools for user experience evaluation: A systematic mapping study. In Anais do XXI Simpósio Brasileiro de Sistemas de Informação.
Barnett, S., Kurniawan, S., Thudumu, S., Brannelly, Z., and Abdelrazek, M. (2024). Seven failure points when engineering a retrieval augmented generation system. Proceedings of the IEEE/ACM 3rd International Conference on AI Engineering – Software Engineering for AI.
Brådland, H., Goodwin, M., Andersen, P.-A., Nossum, A. S., and Gupta, A. (2025). A new hope: Domain-agnostic automatic evaluation of text chunking.
Figueiredo, L. O., Lopes, A. M. Z., Validorio, V. C., and Mussio, S. C. (2023). Desafios e impactos do uso da inteligência artificial na educação. Educação Online.
Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., Küttler, H., Lewis, M., Yih, W.-t., Rocktäschel, T., Riedel, S., and Kiela, D. (2020). Retrieval-augmented generation for knowledge-intensive nlp tasks. In International Conference on Neural Information Processing Systems (NeurIPS).
Peng, Z., Liu, X., and Yang, G. (2025). Latesplit: Lightweight post-retrieval chunking for query-aligned text segmentation in rag systems. In 2025 IEEE 6th International Seminar on Artificial Intelligence, Networking and Information Technology (AINIT).
Soares, T., Costa, R., Soares, E., Calderon, I., Lunardi, G., Valle, P., Guedes, G., and Silva, W. (2025). Machine learning-assisted tools for user experience evaluation: A systematic mapping study. In Anais do XXI Simpósio Brasileiro de Sistemas de Informação.
Publicado
12/11/2025
Como Citar
AMARAL, Vitor Mateus R. do; ESPINOSA, Luiza; LUNARDI, Gabriel M.; OLIVEIRA, Adriano Q. de; SILVEIRA, Thiago Lopes T. da; EMMENDORFER, Leonardo R..
Analysis of the Chunking Process in RAG Architectures. In: ESCOLA REGIONAL DE APRENDIZADO DE MÁQUINA E INTELIGÊNCIA ARTIFICIAL DA REGIÃO SUL (ERAMIA-RS), 1. , 2025, Porto Alegre/RS.
Anais [...].
Porto Alegre: Sociedade Brasileira de Computação,
2025
.
p. 404-407.
DOI: https://doi.org/10.5753/eramiars.2025.16765.