Improving researcher's area of expertise identification using TF-IDF Characters N-grams

  • Felipe Penhorate Carvalho da Fonseca Universidade de São Paulo
  • Luciano Antonio Digiampietri Universidade de São Paulo


As the academic information on the internet became broadly available in the shape of academic social networks and academic profiles, its usage to help to resolve tasks like the discovery of specialists in a given area, identification of potential scholarship holders, or suggestion of collaborators, for example, had a growth in importance and relevance. In the case of academic social networks, the Brazilian government created the Lattes Platform in order to manage academic data from Brazilian researchers as well as use it to help in the evaluation of researchers and groups of researchers. However, in order to use the Lattes Platform information to help in the aforementioned tasks, it is important to check the quality of the data, because most of it is declared by the users and does not have any verification of its veracity, specially regarding the declared main expertise area. Thus, this article explores the usage of machine learning techniques to recognize the main areas of expertise of researchers using several numerical representations to represent its scientific production titles as data source for the algorithms. We have been able to surpass the current state-of-art results to resolve this problem by using a TF-IDF character n-gram representation for the text in the titles, achieving an accuracy of 95.91%.
Palavras-chave: Text Mining, Information extraction, Feature representation


Charu C. Aggarwal and ChengXiang Zhai. 2012. An Introduction to Text Mining. Springer US, Boston, MA, 1–10.

Yoshua Bengio, Réjean Ducharme, and Pascal Vincent. 2001. A Neural Probabilistic Language Model. In Advances in Neural Information Processing Systems, T. Leen, T. Dietterich, and V. Tresp (Eds.), Vol. 13. MIT Press, Cambridge, Massachusetts, USA, 932–938. [link].

David M. Blei, Andrew Y. Ng, and Michael I. Jordan. 2003. Latent Dirichlet Allocation. J. Mach. Learn. Res. 3 (March 2003), 993–1022. 

F. M. CHAGAS, J. J. PEREZ-ALCAZAR, and L. A. DIGIAMPIETRI. 2015. Algoritmo de classificação de especialistas em áreas na base de currículos Lattes. Em Questão 21(2015), 119–139.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Association for Computational Linguistics, Minneapolis, Minnesota, 4171–4186.

Ronen Feldman and James Sanger. 2007. The Text Mining Handbook - Advanced Approaches in Analyzing Unstructured Data. Cambridge University Press, Cambridge, UK. 410 pages. 

F. Fonseca and L. Digiampietri. 2018. Inference of Researchers’ Area of Expertise. In 2018 7th Brazilian Conference on Intelligent Systems (BRACIS). IEEE, New York, New York, USA, 67–72.

F. Fonseca and L. A. Digiampietri. 2016. Análise da relação entre obtenção de bolsas de produtividade do CNPq e medidas bibliométricas e de análise de redes sociais. In V Brazilian Workshop on Social Network Analysis and Mining (BraSNAM 2016). SBC, Porto Alegre, Brazil, 12 pages.

Armand Joulin, Edouard Grave, Piotr Bojanowski, and Tomas Mikolov. 2017. Bag of Tricks for Efficient Text Classification. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers. Association for Computational Linguistics, Valencia, Spain, 427–431.

William Maruyama and Luciano Digiampietri. 2016. Co-authorship prediction in academic social network. In Anais do V Brazilian Workshop on Social Network Analysis and Mining (Porto Alegre). SBC, Porto Alegre, RS, Brasil, 61–72.

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Distributed Representations of Words and Phrases and their Compositionality. CoRR abs/1310.4546(2013), 9 pages. arxiv:1310.4546

B. K. O. Miyata, V. Y. Kano, and L. A. Digiampietri. 2013. Combinando mineração de textos e análise de redes sociais para a identificação das áreas de atuação de pesquisadores. In II Brazilian Workshop on Social Network Analysis and Mining (BraSNAM 2013). SBC, Porto Alegre, Brazil, 79–90.

Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. Glove: Global Vectors for Word Representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Vol. 14. MIT Press, Cambridge, Massachusetts, USA, 1532–1543.

Jakub Piskorski and Guillaume Jacquet. 2020. TF-IDF Character N-grams versus Word Embedding-based Models for Fine-grained Event Classification: A Preliminary Study. In Proceedings of the Workshop on Automated Extraction of Socio-political Events from News 2020. European Language Resources Association (ELRA), Marseille, France, 26–34.

Karen Sparck Jones. 1988. A Statistical Interpretation of Term Specificity and Its Application in Retrieval. In Document Retrieval Systems, Peter Willett (Ed.). Taylor Graham Publishing, London, UK, 132–142.
Como Citar

Selecione um Formato
DA FONSECA, Felipe Penhorate Carvalho; DIGIAMPIETRI, Luciano Antonio. Improving researcher's area of expertise identification using TF-IDF Characters N-grams. In: SIMPÓSIO BRASILEIRO DE SISTEMAS DE INFORMAÇÃO (SBSI), 17. , 2021, Uberlândia. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2021 .