Evaluation of text similarity metrics to remove name ambiguity of authors
Abstract
Name disambiguation consists of identifying different names that appear in a bibliographic database that refer to the same author. One of the primitives to tackle this problem consists of string similarity metrics applied to pairs of author names. This article evaluates the performance of three string similarity measures (Levenshtein, LCS, TLSH) using a real database with more than 10 thousand authors with more than one name and a universe of 7.3 million different names. A methodology based on ordering the distances of the names is applied to more accurately compare the different similarity metrics. Results clearly indicate that the LCS is superior to the others, but still does not adequately identify the synonymous names in a large fraction of cases.
References
Ferreira, A. A., Gonçalves, M. A., and Laender, A. H. (2012). A brief survey of automatic methods for author name disambiguation. SIGMOD Rec., 41(2):15–26.
Gomide, J., Kling, H., and Figueiredo, D. (2021). Consolidating identities in anonymous ego-centred collaboration networks. Journal of Complex Networks, 9(1).
Han, H., Giles, L., Zha, H., Li, C., and Tsioutsiouliklis, K. (2004). Two supervised learning approaches for name disambiguation in author citations. In ACM/IEEE Conference on Digital Libraries, pages 296–305.
Levenshtein, V. I. (1966). Binary codes capable of correcting deletions, insertions, and reversals. Soviet Physics Doklady, 10(8):707–710.
Ley, M. (2002). The dblp computer science bibliography: Evolution, research issues, perspectives. In International symposium on string processing and information retrieval, pages 1–10.
Oliver, J., Cheng, C., and Chen, Y. (2013). Tlsh – a locality sensitive hash. In 2013 Fourth Cybercrime and Trustworthy Computing Workshop, pages 7–13.
Sanyal, D. K., Bhowmick, P. K., and Das, P. P. (2021). A review of author name disambiguation techniques for the pubmed bibliographic database. Journal of Information Science, 47(2):227–254.
Yang, K.-H., Peng, H.-T., Jiang, J.-Y., Lee, H.-M., and Ho, J.-M. (2008). Author name disambiguation for citations using topic and web correlation. In International Conference on Theory and Practice of Digital Libraries, pages 185–196.
Zhou, G., Zhang, J., Su, J., Shen, D., and Tan, C. (2004). Recognizing names in biomedical texts: a machine learning approach. Bioinformatics, 20(7):1178–1190.
