A Hybrid Machine Learning Method to Author Name Disambiguation
Resumo
Digital bibliographic repositories, including publications, authors, and research fields are essential for sharing scientific information. Nevertheless, the information retrieval, extraction, and classification efficiency in such archives is threatened by author name ambiguity. This paper addresses the Author Name Disambiguation (AND) problem by proposing a hybrid machine learning method integrating Bidirectional Encoder Representations from Transformers (BERT), Graph Convolutional Network (GCN), and Graph Enhanced Hierarchical Agglomerative Clustering (GHAC) approaches. The BERT model extracts textual data from scientific documents, the GCN structures global data from academic graphs, and GHAC considers heterogeneous networks’ global context to identify scientific collaboration patterns. We compare the hybrid method with AND state-of-the-art work using a publicly accessible data set consisting of 7,886 documents, 137 unique authors, and 14 groups of ambiguous authors, along with recognized validation metrics. The results achieved a high precision score of 93.8%, recall of 96.3%, F1-measure of 95%, Average Cluster Purity (ACP) of 96.5%, Average Author Purity (AAP) of 97.4% and K-Metric of 96.9%. Compared to the AND baseline approach, the hybrid method presents better results indicating a promising approach.
Referências
AMiner (2024a). Aminer dataset. Disponível em [link].
Beltagy, I., Cohan, A., and Lo, K. (2019). Scibert: Pretrained contextualized embeddings for scientific text. CoRR, abs/1903.10676. [link] DOI: 10.48550/arXiv.1903.10676
CiteSeerX (2007-2019). Scientific literature digital library and search engine. [link]. Pennsylvania State University, University Park, PA 16802, USA.
DBLP (1993-2024). The digital bibliography & library project. [link]. Schloss Dagstuhl, Leibniz-Zentrum fu ̈r Informatik, LZI GmbH.
Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. page 4171–4186, Minneapolis, Minnesota, USA. Proceedings of NAACL-HLT 2019, Association for Computational Linguistics. [link]
Ferreira, A. A., Gonçalves, M. A., and Laender, A. H. F. (2020). Automatic disambiguation of author names in bibliographic repositories. Synthesis Lectures on Information Concepts, Retrieval, and Services, 12(1):1–146. [link] DOI: 10.1007/978-3-031-02322-4
Hussain, I. and Asghar, S. (2017). A survey of author name disambiguation techniques: 2010-2016. Knowledge Eng. Review, 32:e22. [link] DOI: 10.1017/S0269888917000182
Kim, J. and Owen-Smith, J. (2020). Model reuse in machine learning for author name disambiguation: An exploration of transfer learning. IEEE Access, 8:188378–188389. [link] DOI: 10.1109/ACCESS.2020.3031112
Kingma, D. P. (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980. [link] DOI: 10.48550/arXiv.1412.6980
Kipf, T. N. and Welling, M. (2017). Semi-supervised classification with graph convolutional networks. International Conference on Learning Representations (ICLR). [link]
Pooja, K. M., Mondal, S., and Chandra, J. (2022). Exploiting higher order multi-dimensional relationships with self-attention for author name disambiguation. ACM Transactions on Knowledge Discovery from Data, 16(5). [link] DOI: 10.1145/3502730
Qiao, Z., Du, Y., Fu, Y., Wang, P., and Zhou, Y. (2019). Unsupervised author disambiguation using heterogeneous graph convolutional network embedding. In 2019 IEEE International Conference on Big Data (Big Data), pages 910–919. [link] DOI: 10.1109/BigData47090.2019.9005458
Rodrigues, N. S., Mariano, A. M., and Ralha, C. G. (2024). Author name disambiguation literature review with consolidated meta-analytic approach. International Journal on Digital Libraries, pages 1–21. [link] DOI: 10.1007/s00799-024-00398-1
Shin, D., Kim, T., Choi, J., and Kim, J. (2014). Author name disambiguation using a graph model with node splitting and merging based on bibliographic information. Scientometrics, 100(1):15–50. [link] DOI: 10.1007/s11192-014-1289-4
Waqas, H. and Qadir, A. (2022). Completing features for author name disambiguation (AND): An empirical analysis. Scientometrics, 127(2):1039–1063. [link] DOI: 10.1007/s11192-021-04229-x
Waqas, H. and Qadir, M. A. (2021). Multilayer heuristics based clustering framework (MHCF) for author name disambiguation. Scientometrics, 126(9):7637–7678. [link] DOI: 10.1007/s11192-021-04087-7
Zhang, S., Tong, H., Xu, J., and Maciejewski, R. (2019). Graph convolutional networks: a comprehensive review. Computational Social Networks, 6(11). [link] DOI: 10.1186/s40649-019-0069-y