Preliminary Ranking-Based Selection for Optimized Retriever Configuration in RAG Systems

Resumo


Retrieval Augmented Generation (RAG) systems rely on efficient retrievers to fetch relevant documents, with their performance influenced by factors such as chunking methods and embedding models. These components determine how documents are segmented and semantically represented, directly impacting retrieval effectiveness. To enhance retriever performance, this study explores the construction of an optimizer capable of selecting the best configuration for document retrieval within a predefined solution space. The selection of technologies is a critical step, considering project constraints, query nature, and domain-specific requirements. A key challenge is filtering out unsuitable technologies while ensuring optimal performance.

Palavras-chave: Retrieval-Augmented Generation (RAG), Retriever Optimization, Embedding Models, Text Chunking, Information Retrieval Systems, AI Component Optimization, Recall@k, MRR (Mean Reciprocal Rank)

Referências

Antematter team. Optimizing retrieval-augmented generation with advanced chunking techniques: A comparative study, 2024. Accessed: 2025-03-31. Emily M Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell.

On the dangers of stochastic parrots: Can language models be too big? In Proceedings of FAccT, 2021.

Ilias Chalkidis, Manos Fergadiotis, Prodromos Malakasiotis, Nikolaos Aletras, and Ion Androutsopoulos. LEGAL-BERT: The muppets straight out of law school. In Trevor Cohn, Yulan He, and Yang Liu, editors, Findings of the Association for Computational Linguistics: EMNLP 2020, pages 2898–2904, Online, November 2020. Association for Computational Linguistics. DOI: 10.18653/v1/2020.findings-emnlp.261. URL [link].

Shahul Es, Jithin James, Luis Espinosa-Anke, and Steven Schockaert. Ragas: Automated evaluation of retrieval augmented generation, 2023. URL DOI: 10.48550/arXiv.2309.15217.

Naman Gupta. Bge-m3 vs openai embeddings: A comparative study. [link], 2024.

Gautier Izacard and Edouard Grave. Leveraging passage retrieval with generative models for open domain question answering. In Proceedings of EACL, 2021.

Jungwoo Kang, Jinhyuk Lee, and Jaewoo Kang. Knowledge graph-augmented language models for knowledge-grounded dialogue generation. arXiv preprint arXiv:2305.18846, 2023.

Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. Dense passage retrieval for open-domain question answering. In Proceedings of EMNLP, 2020.

Urvashi Khandelwal, Omer Levy, Dan Jurafsky, Luke Zettlemoyer, and Mike Lewis. Generalization through memorization: Nearest neighbor language models. arXiv preprint arXiv:1911.00172, 2020.

Joon Lee, Hyoungho Yoon, and Hyeoun-Ae Park. Explainable ai in healthcare: From black box to interpretable models. Healthcare Informatics Research, 27(1):1–9, 2021.

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Angela Fan, Vishrav Chaudhary, Tim Rocktäschel, and Sebastian Riedel. Retrieval-augmented generation for knowledge-intensive nlp tasks. In Advances in Neural Information Processing Systems, 2020a.

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela. Retrieval-augmented generation for knowledge-intensive nlp tasks, 2020b. URL DOI: 10.48550/arXiv.2005.11401.

Xiao Liu, Zihan Zhou, Tianyu Zhou, Maosong Sun, and Tianyu Wang. Bge-m3: A multi-function embedding model for dense, sparse and multi-vector retrieval. arXiv preprint arXiv:2402.03216, 2024. URL [link].

Zuhong Liu, Charles-Elie Simon, and Fabien Caspani. Passage segmentation of documents for extractive question answering, 2025. URL DOI: 10.48550/arXiv.2501.09940.

Yi Luan, Kaitao Tang, Mandar Joshi Gupta, and Luke Zettlemoyer. Sparse retrieval for question answering. In Proceedings of ACL, 2021.

Margaret Mitchell, Simone Wu, Andrew Zaldivar, Parker Barnes, Lucy Vasserman, Ben Hutchinson, Elena Spitzer, Inioluwa Deborah Raji, and Timnit Gebru. Model cards for model reporting. Proceedings of the Conference on Fairness, Accountability, and Transparency, 2019.

Niklas Muennighoff, Nizar Tazi, et al. Mteb: Massive text embedding benchmark. [link], 2023.

Taichi Nishikawa, Soichiro Hidaka, Sho Yokoi, and Hideki Nakayama. Towards entity-enhanced RAG: Augmenting retrieval augmented generation with entity annotation. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, 2022.

Ori Ram, Yoav Levine, Itay Dalmedigos, Dor Muhlgay, Amnon Shashua, Kevin Leyton-Brown, and Yoav Shoham. In-context retrieval-augmented language models. Transactions on Machine Learning Research, 2023.

Nils Reimers and Iryna Gurevych. Sentence-BERT: Sentence embeddings using siamese BERT-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing, 2019.

Krystian Safjan. From fixed-size to nlp chunking - a deep dive into text chunking techniques, 2023. Accessed: 2025-03-31.

Emily Sheng, Kai-Wei Chang, Premkumar Natarajan, and Nanyun Peng. The woman worked as a babysitter: On biases in language generation. In Proceedings of EMNLP, 2019.

Nandan Thakur, Nils Reimers, Andreas Rücklé, Abhishek Srivastava, and Iryna Gurevych. Beir: A heterogeneous benchmark for zero-shot evaluation of information retrieval models. In Proceedings of the International Conference on Neural Information Processing Systems, 2022.

Rui Wang and Lili Zhao. Ai in education and policy-making: A review of recent advances. Educational Technology Research and Development, 71(2):135–152, 2023.

Xiang Wang, Xiangyu Dong, Fuzheng Zhang, Liwei Wang, and Xing Xie. Kepler: A unified model for knowledge embedding and pre-trained language representation, 2021. URL [link]. BLOG /Survey style reference.

Andrew Yates, Sebastian Hofstätter, and Guido Zuccon. Pretrained transformers for text ranking: Bert and beyond. arXiv preprint arXiv:2104.08663, 2021.

Zijie Zhong, Hanwen Liu, Xiaoya Cui, Xiaofan Zhang, and Zengchang Qin. Mix-of-granularity: Optimize the chunking granularity for retrieval-augmented generation, 2025. URL DOI: 10.48550/arXiv.2406.00456.
Publicado
29/09/2025
LUDOVICO PARANHOS, Salvador; TOMAZINI, Jonatas Novais; OLIVEIRA, Sávio Teles de; CAMILO JUNIOR, Celso; DE OLIVEIRA, Sávio Salvarino Teles. Preliminary Ranking-Based Selection for Optimized Retriever Configuration in RAG Systems. In: SIMPÓSIO BRASILEIRO DE BANCO DE DADOS (SBBD), 40. , 2025, Fortaleza/CE. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2025 . p. 725-738. ISSN 2763-8979. DOI: https://doi.org/10.5753/sbbd.2025.247492.