Aprimorando Geração Aumentada por Recuperação via Ajuste Fino Sequencial de Modelos de Linguagem Pequenos

Ronaldinho Vega Centeno Olivera; Frances A. Santos; Julio Cesar dos Reis; Allan M. de Souza

doi:10.5753/sbbd.2025.247070

Ronaldinho Vega Centeno Olivera Universidade Estadual de Campinas (UNICAMP) http://orcid.org/0009-0001-4756-9726
Frances A. Santos Universidade Estadual de Campinas (UNICAMP) https://orcid.org/0000-0002-0110-6507
Julio Cesar dos Reis Universidade Estadual de Campinas (UNICAMP) http://orcid.org/0000-0002-9545-2098
Allan M. de Souza Universidade Estadual de Campinas (UNICAMP) https://orcid.org/0000-0002-5518-8392

DOI: https://doi.org/10.5753/sbbd.2025.247070

Resumo

Modelos de linguagem (Language Models, LMs) se destacam em conhecimento geral, mas frequentemente enfrentam dificuldades em domínios especializados, nos quais a complexidade e a constante evolução representam desafios adicionais. Este estudo visa aprimorar a efetividade de sistemas de Geração Aumentada por Recuperação (Retrieval-Augmented Generation, RAG) para a tarefa de Perguntas e Respostas (Question Answering, QA) por meio do ajuste sequencial dos componentes do RAG, utilizando modelos de linguagem pequenos (Small Language Models, SLMs). Nossa abordagem ajusta tanto o modelo de embedding quanto o modelo generativo utilizando poucos recursos computacionais e melhora a efetividade geral em relação ao vanilla RAG. A metodologia proposta, escalável e econômica, viabiliza a aplicação prática de sistemas RAG em diferentes domínios e tarefas.

Palavras-chave: Geração Aumentada por Recuperação, Ajuste Fino, LoRA, Modelos de Linguagem Pequenos, Recuperação de Informação, Domínios Especializados

Referências

3rd Generation Partnership Project (3GPP) (2023). 3gpp specifications - release 18. [link].

Clark, C., Lee, K., Chang, M.-W., Kwiatkowski, T., Collins, M., and Toutanova, K. (2019). BoolQ: Exploring the surprising difficulty of natural yes/no questions. In Burstein, J., Doran, C., and Solorio, T., editors, Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 2924–2936, Minneapolis, Minnesota. Association for Computational Linguistics.

Fan, W., Ding, Y., Ning, L., Wang, S., Li, H., Yin, D., Chua, T.-S., and Li, Q. (2024). A survey on rag meeting llms: Towards retrieval-augmented large language models. In Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, KDD ’24, page 6491–6501, New York, NY, USA. Association for Computing Machinery.

Gao, Y., Xiong, Y., Gao, X., Jia, K., Pan, J., Bi, Y., Dai, Y., Sun, J., Wang, M., and Wang, H. (2024). Retrieval-augmented generation for large language models: A survey.

Gichamba, A., Idris, T. K., Ebiyau, B., Nyberg, E., and Mitamura, T. (2024). Colbert retrieval and ensemble response scoring for language model question answering. Accepted at the 2024 IEEE Global Communications (GLOBECOM) Workshops.

Han, Z., Gao, C., Liu, J., Zhang, J., and Zhang, S. Q. (2024). Parameter-efficient fine-tuning for large models: A comprehensive survey. Transactions on Machine Learning Research.

Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., and Chen, W. (2022). LoRA: Low-rank adaptation of large language models. In International Conference on Learning Representations.

Johnson, J., Douze, M., and Jégou, H. (2021). Billion-scale similarity search with gpus. IEEE Transactions on Big Data, 7(3):535–547.

Karpukhin, V., Oguz, B., Min, S., Lewis, P., Wu, L., Edunov, S., Chen, D., and Yih, W.-t. (2020). Dense passage retrieval for open-domain question answering. In Webber, B., Cohn, T., He, Y., and Liu, Y., editors, Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 6769–6781, Online. Association for Computational Linguistics.

Kwiatkowski, T., Palomaki, J., Redfield, O., Collins, M., Parikh, A., Alberti, C., Epstein, D., Polosukhin, I., Devlin, J., Lee, K., Toutanova, K., Jones, L., Kelcey, M., Chang, M.-W., Dai, A. M., Uszkoreit, J., Le, Q., and Petrov, S. (2019). Natural questions: A benchmark for question answering research. Transactions of the Association for Computational Linguistics, 7:452–466.

LangChain (2023). How to split by token in langchain. [link]. Accessed: 2025-01-01.

Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., Küttler, H., Lewis, M., Yih, W.-t., Rocktäschel, T., Riedel, S., and Kiela, D. (2020). Retrieval-augmented generation for knowledge-intensive nlp tasks. In Proceedings of the 34th International Conference on Neural Information Processing Systems, NIPS ’20, Red Hook, NY, USA. Curran Associates Inc.

Maatouk, A., Ayed, F., Piovesan, N., Domenico, A. D., Debbah, M., and Luo, Z.-Q. (2023). Teleqna: A benchmark dataset to assess large language models telecommunications knowledge.

Maslej, N., Fattorini, L., Brynjolfsson, E., Etchemendy, J., Ligett, K., Lyons, T., Manyika, J., Ngo, H., Niebles, J. C., Parli, V., Shoham, Y., Wald, R., Clark, J., and Perrault, R. (2023). Artificial intelligence index report 2023.

Microsoft Research (2023). Phi-2: The surprising power of small language models. [link]. Accessed: 2025-01-01.

Möller, T., Reina, A., Jayakumar, R., and Pietsch, M. (2020). COVID-QA: A question answering dataset for COVID-19. In Verspoor, K., Cohen, K. B., Dredze, M., Ferrara, E., May, J., Munro, R., Paris, C., and Wallace, B., editors, Proceedings of the 1st Workshop on NLP for COVID-19 at ACL 2020, Online. Association for Computational Linguistics.

Muennighoff, N., Tazi, N., Magne, L., and Reimers, N. (2023). MTEB: Massive text embedding benchmark. In Vlachos, A. and Augenstein, I., editors, Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pages 2014–2037, Dubrovnik, Croatia. Association for Computational Linguistics.

Piovesan, N., De Domenicoo, A., and Ayed, F. (2024). Telecom language models: Must they be large? In 2024 IEEE 35th International Symposium on Personal, Indoor and Mobile Radio Communications (PIMRC), pages 1–6.

Reimers, N. and Gurevych, I. (2019). Sentence-BERT: Sentence embeddings using Siamese BERT-networks. In Inui, K., Jiang, J., Ng, V., and Wan, X., editors, Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3982–3992, Hong Kong, China. Association for Computational Linguistics.

Rosenthal, S., Sil, A., Florian, R., and Roukos, S. (2025). CLAPnq: Cohesive long-form answers from passages in natural questions for RAG systems. Transactions of the Association for Computational Linguistics, 13:53–72.

Siriwardhana, S., Weerasekera, R., Wen, E., Kaluarachchi, T., Rana, R., and Nanayakkara, S. (2023). Improving the domain adaptation of retrieval augmented generation (RAG) models for open domain question answering. Transactions of the Association for Computational Linguistics, 11:1–17.

Xiao, S., Liu, Z., Zhang, P., and Muennighoff, N. (2023). C-pack: Packaged resources to advance general chinese embedding.

Xiao, S., Liu, Z., Zhang, P., Muennighoff, N., Lian, D., and Nie, J.-Y. (2024). C-pack: Packed resources for general chinese embeddings. In Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’24, page 641–649, New York, NY, USA. Association for Computing Machinery.

Zhou, H., Hu, C., Yuan, Y., Cui, Y., Jin, Y., Chen, C., Wu, H., Yuan, D., Jiang, L., Wu, D., Liu, X., Zhang, C., Wang, X., and Liu, J. (2024). Large language model (llm) for telecommunications: A comprehensive survey on principles, key techniques, and opportunities.