Leveraging Large Language Models for Author Name Disambiguation in Portuguese Contexts

  • Samuel G. dos Santos UEG
  • Vitória M. Diniz UEG
  • Bartolomeu S. Gusella UEG
  • Natan de S. Rodrigues UEG

Resumo


Author Name Disambiguation (AND) is a fundamental task in digital libraries and repositories, especially in Portuguese contexts where metadata often lacks persistent identifiers and shows frequent inconsistencies. This study presents an unsupervised approach that integrates Large Language Models (LLMs) for semantic normalization, MiniLM embeddings for similarity modeling, and automatic clustering followed by a post-merging heuristic. Experiments on the BDBComp dataset show competitive cluster cohesion (K = 0.907) compared to baselines, while the pairwise F1 score (pF1 = 0.448) highlights the difficulty posed by highly ambiguous surnames. Future work will refine LLM summaries and clustering thresholds to improve accuracy while preserving cluster consistency.
Palavras-chave: Author Name Disambiguation, Large Language Models, LLMs, Unsupervised Learning, Portuguese Contexts

Referências

Cota, R. G., Ferreira, A. A., Nascimento, C., Gonçalves, M. A., and Laender, A. H. F. (2010). An unsupervised heuristic-based hierarchical method for name disambiguation in bibliographic citations. Journal of the American Society for Information Science and Technology, 61(9):1853–1870.

DeepSeek-AI (2025). Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.

Ferreira, A. A., Gonçalves, M. A., and Laender, A. H. F. (2012). A brief survey of automatic methods for author name disambiguation. Acm Sigmod Record, 41(2):15–26.

Ferreira, A. A., Gonçalves, M. A., and Laender, A. H. F. (2020). Automatic Disambiguation of Author Names in Bibliographic Repositories. Morgan & Claypool Publishers.

Ferreira, A. A., Veloso, A., Gonçalves, M. A., and Laender, A. H. F. (2014). Self-training author name disambiguation for information scarce scenarios. Journal of the Association for Information Science and Technology (JASIST), 65(6):1257–1278.

Rodrigues, M. E. P. and Rodrigues, A. M. (2024). Desambiguação de nomes de autores: um desafio para os repositórios / Disambiguation of authors’ names: A challenge for repositories. Ciência da Informação, 53(3). 15a Conferência Lusófona de Ciência Aberta (ConfOA), Modalidade: Pecha Kucha. IPCB & CERNAS-IPCB; ESA/IPCB & CERNAS-IPCB.

Rodrigues, N. S., Costa, A. R., Lemos, L. C., and Ralha, C. G. (2021). Multi-strategic approach for author name disambiguation in bibliography repositories. In Lossio-Ventura, J., Valverde-Rebaza, J., Díaz, E., and Alatrista-Salas, H., editors, Information Management and Big Data. SIMBig 2020. Communications in Computer and Information Science, vol. 1410. Springer, Cham.

Rodrigues, N. S., Mariano, A. M., and Ralha, C. G. (2024). Author name disambiguation literature review with consolidated meta-analytic approach. International Journal on Digital Libraries, pages 765–785.

Wang, W., Bao, H., Huang, S., Dong, L., and Wei, F. (2021). Minilmv2: Multi-head self-attention relation distillation for compressing pretrained transformers.

Yan, Q. and AsirAsir (2024). Synergizing large language models and tree-based algorithms for author name disambiguation. In Submitted to KDD 2024 OAG-Challenge Cup.

Zhang, X., Zhou, Y., Chen, H., Bao, M., and Yan, P. (2024). Enhanced name disambiguation via iterative self-refining with LLMs. In Submitted to KDD 2024 OAG-Challenge Cup.

Zhao, R. and Chen, Y. (2025). Scholar name disambiguation with search-enhanced llm across language.
Publicado
04/12/2025
SANTOS, Samuel G. dos; DINIZ, Vitória M.; GUSELLA, Bartolomeu S.; RODRIGUES, Natan de S.. Leveraging Large Language Models for Author Name Disambiguation in Portuguese Contexts. In: ESCOLA REGIONAL DE INFORMÁTICA DE GOIÁS (ERI-GO), 13. , 2025, Luziânia/GO. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2025 . p. 349-353. DOI: https://doi.org/10.5753/erigo.2025.16983.