BERT vs. LLM2Vec: A Comparative Study of Embedding Models for Semantic Information Retrieval
Abstract
Semantic-based Information Retrieval (IR) has significantly benefited from advances in language models and embedding techniques. This work investigates the impact of different embedding strategies on the effectiveness of semantic retrieval, using 1-NN classification and F1-score as the evaluation metric. We evaluate two model families: BERT variants and the novel LLM2Vec approach. Experiments conducted on six diverse datasets show that LLM2Vec models consistently outperform BERT-based ones across all metrics, with the Mistral-7B-Instruct-v2 model in its unsupervised configuration achieving the highest scores. Additionally, we demonstrate that LLM2Vec performance is robust to prompt variations, highlighting its practical applicability in IR systems.References
Abbasiantaeb, Z. and Momtazi, S. (2021). Text-based question answering from information retrieval and deep neural network perspectives: A survey. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 11(6):e1412.
BehnamGhader, P., Adlakha, V., Mosbach, M., Bahdanau, D., Chapados, N., and Reddy, S. (2024). Llm2vec: Large language models are secretly powerful text encoders. ArXiv, abs/2404.05961.
Bhopale, A. P. and Tiwari, A. (2024). Transformer based contextual text representation framework for intelligent information retrieval. Expert Systems with Applications, 238:121629.
Caspari, L., Dastidar, K. G., Zerhoudi, S., Mitrović, J., and Granitzer, M. (2024). Beyond benchmarks: Evaluating embedding model similarity for retrieval augmented generation systems. ArXiv, abs/2407.08275.
Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. (2019). Bert: Pre-training of deep bidirectional transformers for language understanding. In North American Chapter of the Association for Computational Linguistics.
Ding, M., Zhou, C., Yang, H., and Tang, J. (2020). Cogltx: Applying bert to long texts. Advances in Neural Information Processing Systems, 33:12792–12804.
Gao, S., Alawad, M., Young, M. T., Gounley, J., Schaefferkoetter, N., Yoon, H. J., Wu, X.-C., Durbin, E. B., Doherty, J., Stroup, A., et al. (2021). Limitations of transformers on clinical text classification. IEEE journal of biomedical and health informatics, 25(9):3596–3607.
Gomathi, D. S. and Lavanya, D. M. (2021). A survey on application of information retrieval models using nlp. Int. J. of Aquatic Science, 12(3):2129–2138.
Hambarde, K. A. and Proenca, H. (2023). Information retrieval: recent advances and beyond. IEEE Access, 11:76581–76604.
He, J., Rungta, M., Koleczek, D., Sekhon, A., Wang, F. X., and Hasan, S. (2024). Does prompt formatting have any impact on llm performance? arXiv preprint arXiv:2411.10541.
Li, X., Jin, J., Zhou, Y., Zhang, Y., Zhang, P., Zhu, Y., and Dou, Z. (2025). From matching to generation: A survey on generative information retrieval. ACM Transactions on Information Systems, 43(3):1–62.
Liu, Y.-A., Zhang, R., Guo, J., and de Rijke, M. (2025). Robust information retrieval. In Proceedings of the Eighteenth ACM International Conference on Web Search and Data Mining, pages 1008–1011.
MacAvaney, S., Yates, A., Cohan, A., and Goharian, N. (2019). Cedr: Contextualized embeddings for document ranking. In Proceedings of the 42nd international ACM SIGIR conference on research and development in information retrieval, pages 1101–1104.
Manning, C. D., Raghavan, P., and Schütze, H. (2008). Introduction to Information Retrieval. Cambridge University Press.
Minaee, S., Mikolov, T., Nikzad, N., Chenaghlu, M. A., Socher, R., Amatriain, X., and Gao, J. (2024). Large language models: A survey. ArXiv, abs/2402.06196.
Rossi, R. G., Marcacini, R. M., and Rezende, S. O. (2013). Benchmarking text collections for classification and clustering tasks.
Roy, D., Ganguly, D., Bhatia, S., Bedathur, S., and Mitra, M. (2018). Using word embeddings for information retrieval: How collection and term normalization choices affect performance. In Proceedings of the 27th ACM international conference on information and knowledge management, pages 1835–1838.
Wang, J., Huang, J. X., Tu, X., Wang, J., Huang, A. J., Laskar, M. T. R., and Bhuiyan, A. (2024). Utilizing bert for information retrieval: Survey, applications, resources, and challenges. ACM Computing Surveys, 56(7):1–33.
Wei, J., Wang, X., Schuurmans, D., Bosma, M., ichter, b., Xia, F., Chi, E., Le, Q. V., and Zhou, D. (2022). Chain-of-thought prompting elicits reasoning in large language models. In Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., and Oh, A., editors, Advances in Neural Information Processing Systems, volume 35, pages 24824–24837. Curran Associates, Inc.
Wu, Y., Zhang, H., and Huang, H. (2022). Retrievalguard: Provably robust 1-nearest neighbor image retrieval. In International Conference on Machine Learning, pages 24266–24279. PMLR.
Zhao, W. X., Zhou, K., Li, J., Tang, T., Wang, X., Hou, Y., Min, Y., Zhang, B., Zhang, J., Dong, Z., et al. (2023). A survey of large language models. arXiv preprint arXiv:2303.18223.
BehnamGhader, P., Adlakha, V., Mosbach, M., Bahdanau, D., Chapados, N., and Reddy, S. (2024). Llm2vec: Large language models are secretly powerful text encoders. ArXiv, abs/2404.05961.
Bhopale, A. P. and Tiwari, A. (2024). Transformer based contextual text representation framework for intelligent information retrieval. Expert Systems with Applications, 238:121629.
Caspari, L., Dastidar, K. G., Zerhoudi, S., Mitrović, J., and Granitzer, M. (2024). Beyond benchmarks: Evaluating embedding model similarity for retrieval augmented generation systems. ArXiv, abs/2407.08275.
Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. (2019). Bert: Pre-training of deep bidirectional transformers for language understanding. In North American Chapter of the Association for Computational Linguistics.
Ding, M., Zhou, C., Yang, H., and Tang, J. (2020). Cogltx: Applying bert to long texts. Advances in Neural Information Processing Systems, 33:12792–12804.
Gao, S., Alawad, M., Young, M. T., Gounley, J., Schaefferkoetter, N., Yoon, H. J., Wu, X.-C., Durbin, E. B., Doherty, J., Stroup, A., et al. (2021). Limitations of transformers on clinical text classification. IEEE journal of biomedical and health informatics, 25(9):3596–3607.
Gomathi, D. S. and Lavanya, D. M. (2021). A survey on application of information retrieval models using nlp. Int. J. of Aquatic Science, 12(3):2129–2138.
Hambarde, K. A. and Proenca, H. (2023). Information retrieval: recent advances and beyond. IEEE Access, 11:76581–76604.
He, J., Rungta, M., Koleczek, D., Sekhon, A., Wang, F. X., and Hasan, S. (2024). Does prompt formatting have any impact on llm performance? arXiv preprint arXiv:2411.10541.
Li, X., Jin, J., Zhou, Y., Zhang, Y., Zhang, P., Zhu, Y., and Dou, Z. (2025). From matching to generation: A survey on generative information retrieval. ACM Transactions on Information Systems, 43(3):1–62.
Liu, Y.-A., Zhang, R., Guo, J., and de Rijke, M. (2025). Robust information retrieval. In Proceedings of the Eighteenth ACM International Conference on Web Search and Data Mining, pages 1008–1011.
MacAvaney, S., Yates, A., Cohan, A., and Goharian, N. (2019). Cedr: Contextualized embeddings for document ranking. In Proceedings of the 42nd international ACM SIGIR conference on research and development in information retrieval, pages 1101–1104.
Manning, C. D., Raghavan, P., and Schütze, H. (2008). Introduction to Information Retrieval. Cambridge University Press.
Minaee, S., Mikolov, T., Nikzad, N., Chenaghlu, M. A., Socher, R., Amatriain, X., and Gao, J. (2024). Large language models: A survey. ArXiv, abs/2402.06196.
Rossi, R. G., Marcacini, R. M., and Rezende, S. O. (2013). Benchmarking text collections for classification and clustering tasks.
Roy, D., Ganguly, D., Bhatia, S., Bedathur, S., and Mitra, M. (2018). Using word embeddings for information retrieval: How collection and term normalization choices affect performance. In Proceedings of the 27th ACM international conference on information and knowledge management, pages 1835–1838.
Wang, J., Huang, J. X., Tu, X., Wang, J., Huang, A. J., Laskar, M. T. R., and Bhuiyan, A. (2024). Utilizing bert for information retrieval: Survey, applications, resources, and challenges. ACM Computing Surveys, 56(7):1–33.
Wei, J., Wang, X., Schuurmans, D., Bosma, M., ichter, b., Xia, F., Chi, E., Le, Q. V., and Zhou, D. (2022). Chain-of-thought prompting elicits reasoning in large language models. In Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., and Oh, A., editors, Advances in Neural Information Processing Systems, volume 35, pages 24824–24837. Curran Associates, Inc.
Wu, Y., Zhang, H., and Huang, H. (2022). Retrievalguard: Provably robust 1-nearest neighbor image retrieval. In International Conference on Machine Learning, pages 24266–24279. PMLR.
Zhao, W. X., Zhou, K., Li, J., Tang, T., Wang, X., Hou, Y., Min, Y., Zhang, B., Zhang, J., Dong, Z., et al. (2023). A survey of large language models. arXiv preprint arXiv:2303.18223.
Published
2025-09-29
How to Cite
UTINO, Matheus Yasuo Ribeiro; MARCACINI, Ricardo Marcondes.
BERT vs. LLM2Vec: A Comparative Study of Embedding Models for Semantic Information Retrieval. In: NATIONAL MEETING ON ARTIFICIAL AND COMPUTATIONAL INTELLIGENCE (ENIAC), 22. , 2025, Fortaleza/CE.
Anais [...].
Porto Alegre: Sociedade Brasileira de Computação,
2025
.
p. 427-438.
ISSN 2763-9061.
DOI: https://doi.org/10.5753/eniac.2025.13224.
