An Entity Resolution Approach Based on Word Embeddings and Knowledge Bases for Microblog Texts

Luan Souza; Anderson Ferreira

Luan Souza UFOP
Anderson Ferreira UFOP

Resumo

In the context of information systems in data management, several proposals for entity resolution usually perform on structured data or on long texts that contains contextual information. In short texts, such as microblogs, the lack of context may complicate the disambiguation of named entities mentioned in these texts. On the other hand, word embeddings have been demonstrated as promising techniques for enriching contextual information or being used on similarity estimations. Thus, in this work, we propose an approach for disambiguating named entities gathered from short texts, linking them to documents in a knowledge base using word embeddings and three strategies to find the correct document. Strategy 1 is based on other entity names in the short text. Strategy 2 exploits categories in candidate documents to be linked to the names. And Strategy 3 is based on similarity between documents associated to other named entities from the text and the candidate documents to be linked to the target named entity. In our experimental evaluation, our proposed approach outperforms other approaches usually used in the entity resolution task.

Palavras-chave: entity resolution, word embedding, named entity, information system

Referências

[n.d.]. MS Windows NT Kernel Description. https://www.textrazor.com/demo.

Diego Ceccarelli, Claudio Lucchese, Salvatore Orlando, Raffaele Perego, and Salvatore Trani. 2013. Dexter: an open source framework for entity linking. In Proceedings of the ESAIR. 17–20.

T Crayston. 2019. TextRazor: Technology.

Silviu Cucerzan. 2007. Large-scale named entity disambiguation based on Wikipedia data. In Proceedings of the EMNLP-CoNLL. 708–716.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arxiv:1810.04805 [cs.CL].

Yoav Goldberg and Omer Levy. 2014. word2vec Explained: deriving Mikolov et al.’s negative-sampling word-embedding method. arXiv preprint arXiv:1402.3722(2014).

Mena Habib and Maurice van Keulen. 2015. Need4tweet: a twitterbot for tweets named entity extraction and disambiguation. Proceedings of ACL-IJCNLP 2015 System Demonstrations (2015), 31–36.

Xianpei Han and Jun Zhao. 2009. Named entity disambiguation by leveraging wikipedia semantic knowledge. In Proceedings of the CIKM. 215–224.

David Kempe, Jon Kleinberg, and Éva Tardos. 2003. Maximizing the spread of influence through a social network. In Proceedings of the KDD. 137–146.

Tom Kenter and Maarten De Rijke. 2015. Short text similarity with word embeddings. In Proceedings of the 24th ACM international on conference on information and knowledge management. ACM, 1411–1420.

Hanna Köpcke, Andreas Thor, and Erhard Rahm. 2010. Evaluation of Entity Resolution Approaches on Real-World Match Problems. Proc. VLDB Endow. 3, 1–2 (2010), 484–493. https://doi.org/10.14778/1920841.1920904

Andrey Kretinin, Jim Samuel, and Rajiv Kashyap. 2018. When the Going Gets Tough, The Tweets Get Going! An Exploratory Analysis of Tweets Sentiments in the Stock Market. American Journal of Management 18, 5 (2018).

Matt Kusner, Yu Sun, Nicholas Kolkin, and Kilian Weinberger. 2015. From word embeddings to document distances. In Proceedings of the ICML. 957–966.

Omer Levy, Yoav Goldberg, and Ido Dagan. 2015. Improving distributional similarity with lessons learned from word embeddings. Transactions of the Association for Computational Linguistics 3 (2015), 211–225.

Jisheng Liang, Krzysztof Koperski, Navdeep S Dhillon, Carsten Tusk, and Satish Bhatti. 2013. NLP-based entity recognition and disambiguation. US Patent 8,594,996.

Yuanhua Lv and ChengXiang Zhai. 2011. Lower-bounding term frequency normalization. In Proceedings of the CIKM. 7–16.

Pablo N Mendes, Max Jakob, Andrés García-Silva, and Christian Bizer. 2011. DBpedia spotlight: shedding light on the web of documents. In Proceedings of the 7th international conference on semantic systems. 1–8.

Rada Mihalcea. 2007. Using wikipedia for automatic word sense disambiguation. In Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics; Proceedings of the Main Conference. 196–203.

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Distributed Representations of Words and Phrases and their Compositionality. arxiv:1310.4546 [cs.CL]

David Nadeau and Satoshi Sekine. 2007. A survey of named entity recognition and classification. Lingvisticae Investigationes 30, 1 (2007), 3–26.

Jeffrey Pennington, Richard Socher, and Christopher D Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the EMNLP. 1532–1543.

Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word representations. arxiv:1802.05365 [cs.CL]

Nafiye Polat. 2013. Experiments on company name disambiguation with supervised classification techniques. In Proceedings of the ICECCO. 139–142.

Muhammad Atif Qureshi, Colm O’Riordan, and Gabriella Pasi. 2014. Exploiting wikipedia for entity name disambiguation in tweets. In Proceedings of the NLDB. 184–195.

Thomas Rebele, Fabian Suchanek, Johannes Hoffart, Joanna Biega, Erdal Kuzey, and Gerhard Weikum. 2016. YAGO: A multilingual knowledge base from wikipedia, wordnet, and geonames. In Proceedings of the ISWC. 177–185.

Stuart Rose, Dave Engel, Nick Cramer, and Wendy Cowley. 2010. Automatic keyword extraction from individual documents. Text mining: applications and theory 1 (2010), 1–20.

Toshinori Sato, T Hashimoro, and Manabu Okumura. 2017. Implementation of a word segmentation dictionary called mecab-ipadic-NEologd and study on how to use it effectively for information retrieval. In Proceedings of the Twenty-three Annual Meeting of the Association for Natural Language Processing. The Association for Natural Language Processing.

Scharolta Katharina Sienčnik. 2015. Adapting word2vec to named entity recognition. In Proceedings of the NoDaLiDa. 239–243.

Sameer Singh, Amarnag Subramanya, Fernando Pereira, and Andrew McCallum. 2012. Wikilinks: A large-scale cross-document coreference corpus labeled via links to Wikipedia. University of Massachusetts, Amherst, Tech. Rep. UM-CS-2012 15 (2012).

Ashish Sureka, Vikram Goyal, Denzil Correa, and Anirban Mondal. 2009. Polarity classification of subjective words using common-sense knowledge-base. In Proceedings of the RSFDGrC. 486–493.

Duyu Tang, Furu Wei, Nan Yang, Ming Zhou, Ting Liu, and Bing Qin. 2014. Learning sentiment-specific word embedding for twitter sentiment classification. In Proceedings of the ACL, Vol. 1. 1555–1565.

Tomoaki Urata and Akira Maeda. 2018. An Entity Disambiguation Approach Based on Wikipedia and Word Embeddings for Entity Linking in Microblogs. In Proceedings of the IMECS, Vol. 1.

Will Y Zou, Richard Socher, Daniel Cer, and Christopher D Manning. 2013. Bilingual word embeddings for phrase-based machine translation. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing. 1393–1398.