Applying Data Augmentation for Disambiguating Author Names

Luciano V. B. Espiridião; Laura L. Dias; Anderson A. Ferreira

doi:10.5753/sbbd.2021.17870

Luciano V. B. Espiridião Instituto Federal de Minas Gerais (IFMG) / Universidade Federal de Ouro Preto (UFOP)
Laura L. Dias Universidade Federal de Ouro Preto (UFOP)
Anderson A. Ferreira Universidade Federal de Ouro Preto (UFOP)

DOI: https://doi.org/10.5753/sbbd.2021.17870

Resumo

Author name ambiguity is one of the most challenging issues that can compromise the information quality in a scholarly digital library. For years, researchers have been searched for solutions to solve such a problem. Despite the many methods already proposed, the question remains open. In this study, we address the issue of producing a more accurate disambiguation function by means of applying data augmentation in the set of data training. We also propose a SyGAR-based data augmentation approach and evaluate our proposal on three collections commonly used in works about author name disambiguation task. The experimental results showed scenarios where improvements are possible in the author name disambiguation task. The proposal of data augmentation outperforms other data augmentation approach, as well as improves some machine learning techniques that were not specifically designed for the author name disambiguation task.

Palavras-chave: Author Disambiguation, Data Augmentation, Machine Learning

Referências

Blei, D. M., Ng, A. Y., and Jordan, M. I. (2003). Latent dirichlet allocation. Journal of Machine Learning Research, 3:993–1022.

Cheng, Y., Chen, Z.,Wang, J., Agrawal, A., and Choudhary, A. (2013). Bootstrapping Active Name Disambiguation with Crowdsourcing. In Proceedings of the 22nd ACM International Conference on Information & Knowledge Management, CIKM ’13, pages 1213–1216, New York, NY, USA. Association for Computing Machinery.

Cota, R. G., Gonçalves, M. A., and Laender, A. H. F. (2007). A Heuristic-based Hierarchical Clustering Method for Author Name Disambiguation in Digital Libraries. In Proceedings of the XXII Brazilian Symposium on Databases, pages 20–34, João Pessoa, Paraiba, Brazil.

Ferreira, A. A., Gonçalves, M. A., Almeida, J. M., Laender, A. H., and Veloso, A. (2012a). A tool for generating synthetic authorship records for evaluating author name disambiguation methods. Information Sciences, 206:42–62.

Ferreira, A. A., Gonçalves, M. A., and Laender, A. H. F. (2012b). A Brief Survey of Automatic Methods for Author Name Disambiguation. SIGMOD Record, 41(2):15– 26.

Ferreira, A. A., Gonçalves, M. A., and Laender, A. H. F. (2020). Automatic disambiguation of author names in bibliographic repositories. Synthesis Lectures on Information Concepts, Retrieval, and Services, 12(1):1–146.

Han, H., Giles, C. L., Zha, H., Li, C., and Tsioutsiouliklis, K. (2004). Two supervised learning approaches for name disambiguation in author citations. In Proceedings of the 4th JCDL, pages 296–305, Tucson, USA.

Hussain, I. and Asghar, S. (2017). A survey of author name disambiguation techniques: 2010-2016. The Knowledge Engineering Review, 32:1–24.

Kang, I.-S., Kim, P., Lee, S., Jung, H., and You, B.-J. (2011). Construction of a large-scale test set for author disambiguation. IP&M, 47:452–465.

Kim, J. (2019). A fast and integrative algorithm for clustering performance evaluation in author name disambiguation. Scientometrics, 120(2):661–681.

Kim, J. and Kim, J. (2018). The impact of imbalanced training data on machine learning for author name disambiguation. Scientometrics, 117:511–526.

Kobayashi, S. (2018). Contextual Augmentation: Data Augmentation by Words with Paradigmatic Relations. CoRR, 2:452–457.

Lapidot, I. (2002). Self-Organizing-Maps with BIC for Speaker Clustering. Technical report, IDIAP Research Institute, Martigny, Switzerland.

Muller, M. C., Reitz, F., and Roy, N. (2017). Data sets for author name disambiguation: an empirical analysis and a new resource. Scientometrics, 111:1467–1500.

Oliveira, J. W. A. (2005). Uma estratégia para remoção de ambiguidades na identificação de autoria de objetos bibliográficos. Master’s thesis, Uiversidade Federal de Minas Gerais. Departamento de Ciência da Computação, Belo Horizonte, Brazil.

Santana, A. F., Gonçalves, M. A., Laender, A. H. F., and Ferreira, A. A. (2017). Incremental author name disambiguation by exploiting domain-specific heuristics. Journal of the Association for Information Science and Technology, 68(4):931–945.

Sanyal, D. K., Bhowmick, P. K., and Das, P. P. (2019). A review of author name disambiguation techniques for the PubMed bibliographic database. Journal of Information Science, 47(2):227–254.

Wang,W. Y. and Yang, D. (2015). That’s So Annoying!!!: A Lexical and Frame-Semantic Embedding Based Data Augmentation Approach to Automatic Categorization of Annoying Behaviors using #petpeeve Tweets. In Proceedings of the EMNLP, pages 2557– 2563, Lisbon, Portugal. Association for Computational Linguistics.

Wei, J. and Zou, K. (2019). EDA: Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks. In Proceedings of the EMNLP-IJCNLP, pages 6382–6388, Hong Kong, China. Association for Computational Linguistics.

Zhang, X., Zhao, J., and LeCun, Y. (2016). Character-level Convolutional Networks for Text Classification. In Proceedings of the NIPS, pages 649–657, Cambridge, MA.