A Hybrid Machine Learning Method to Author Name Disambiguation

Natan S. Rodrigues; Celia G. Ralha

doi:10.5753/stil.2024.245440

Natan S. Rodrigues UEG / UnB http://orcid.org/0000-0002-0785-4397
Celia G. Ralha UnB https://orcid.org/0000-0002-2983-2180

DOI: https://doi.org/10.5753/stil.2024.245440

Resumo

Digital bibliographic repositories, including publications, authors, and research fields are essential for sharing scientific information. Nevertheless, the information retrieval, extraction, and classification efficiency in such archives is threatened by author name ambiguity. This paper addresses the Author Name Disambiguation (AND) problem by proposing a hybrid machine learning method integrating Bidirectional Encoder Representations from Transformers (BERT), Graph Convolutional Network (GCN), and Graph Enhanced Hierarchical Agglomerative Clustering (GHAC) approaches. The BERT model extracts textual data from scientific documents, the GCN structures global data from academic graphs, and GHAC considers heterogeneous networks’ global context to identify scientific collaboration patterns. We compare the hybrid method with AND state-of-the-art work using a publicly accessible data set consisting of 7,886 documents, 137 unique authors, and 14 groups of ambiguous authors, along with recognized validation metrics. The results achieved a high precision score of 93.8%, recall of 96.3%, F1-measure of 95%, Average Cluster Purity (ACP) of 96.5%, Average Author Purity (AAP) of 97.4% and K-Metric of 96.9%. Compared to the AND baseline approach, the hybrid method presents better results indicating a promising approach.

Palavras-chave: AND, BERT, Digital Bibliographic Repositories, GCN, GHAC

Referências

AMiner (2005-2024b). Search and mining of academic social networks. [link]. Tsinghua University, Beijing, 100084. China.

AMiner (2024a). Aminer dataset. Disponível em [link].

Beltagy, I., Cohan, A., and Lo, K. (2019). Scibert: Pretrained contextualized embeddings for scientific text. CoRR, abs/1903.10676. [link] DOI: 10.48550/arXiv.1903.10676

CiteSeerX (2007-2019). Scientific literature digital library and search engine. [link]. Pennsylvania State University, University Park, PA 16802, USA.

DBLP (1993-2024). The digital bibliography & library project. [link]. Schloss Dagstuhl, Leibniz-Zentrum fu ̈r Informatik, LZI GmbH.

Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. page 4171–4186, Minneapolis, Minnesota, USA. Proceedings of NAACL-HLT 2019, Association for Computational Linguistics. [link]

Ferreira, A. A., Gonçalves, M. A., and Laender, A. H. F. (2020). Automatic disambiguation of author names in bibliographic repositories. Synthesis Lectures on Information Concepts, Retrieval, and Services, 12(1):1–146. [link] DOI: 10.1007/978-3-031-02322-4

Hussain, I. and Asghar, S. (2017). A survey of author name disambiguation techniques: 2010-2016. Knowledge Eng. Review, 32:e22. [link] DOI: 10.1017/S0269888917000182

Kim, J. and Owen-Smith, J. (2020). Model reuse in machine learning for author name disambiguation: An exploration of transfer learning. IEEE Access, 8:188378–188389. [link] DOI: 10.1109/ACCESS.2020.3031112

Kingma, D. P. (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980. [link] DOI: 10.48550/arXiv.1412.6980

Kipf, T. N. and Welling, M. (2017). Semi-supervised classification with graph convolutional networks. International Conference on Learning Representations (ICLR). [link]

Pooja, K. M., Mondal, S., and Chandra, J. (2022). Exploiting higher order multi-dimensional relationships with self-attention for author name disambiguation. ACM Transactions on Knowledge Discovery from Data, 16(5). [link] DOI: 10.1145/3502730

Qiao, Z., Du, Y., Fu, Y., Wang, P., and Zhou, Y. (2019). Unsupervised author disambiguation using heterogeneous graph convolutional network embedding. In 2019 IEEE International Conference on Big Data (Big Data), pages 910–919. [link] DOI: 10.1109/BigData47090.2019.9005458

Rodrigues, N. S., Mariano, A. M., and Ralha, C. G. (2024). Author name disambiguation literature review with consolidated meta-analytic approach. International Journal on Digital Libraries, pages 1–21. [link] DOI: 10.1007/s00799-024-00398-1

Shin, D., Kim, T., Choi, J., and Kim, J. (2014). Author name disambiguation using a graph model with node splitting and merging based on bibliographic information. Scientometrics, 100(1):15–50. [link] DOI: 10.1007/s11192-014-1289-4

Waqas, H. and Qadir, A. (2022). Completing features for author name disambiguation (AND): An empirical analysis. Scientometrics, 127(2):1039–1063. [link] DOI: 10.1007/s11192-021-04229-x

Waqas, H. and Qadir, M. A. (2021). Multilayer heuristics based clustering framework (MHCF) for author name disambiguation. Scientometrics, 126(9):7637–7678. [link] DOI: 10.1007/s11192-021-04087-7

Zhang, S., Tong, H., Xu, J., and Maciejewski, R. (2019). Graph convolutional networks: a comprehensive review. Computational Social Networks, 6(11). [link] DOI: 10.1186/s40649-019-0069-y