Universal Dependencies for Tweets in Brazilian Portuguese: Tokenization and Part of Speech Tagging

  • Emanuel Huber da Silva USP
  • Thiago Alexandre Salgueiro Pardo USP
  • Norton Trevisan Roman USP
  • Ariani Di Fellipo UFSCar


Automatically dealing with Natural Language User-Generated Content (UGC) is a challenging task of utmost importance, given the amount of information available over the web. We present in this paper an effort on building tokenization and Part of Speech (PoS) tagging systems for tweets in Brazilian Portuguese, following the guidelines of the Universal Dependencies (UD) project. We propose a rule-based tokenizer and the customization of current state-of-the-art UD-based tagging strategies for Portuguese, achieving a 98% f-score for tokenization, and a 95% f-score for PoS tagging. We also introduce DANTEStocks, the corpus of stock market tweets on which we base our work, presenting preliminary evidence of the multi-genre capacity of our PoS tagger.


SILVA, Emanuel Huber da; PARDO, Thiago Alexandre Salgueiro; ROMAN, Norton Trevisan; FELLIPO, Ariani Di. Universal Dependencies for Tweets in Brazilian Portuguese: Tokenization and Part of Speech Tagging. In: ENCONTRO NACIONAL DE INTELIGÊNCIA ARTIFICIAL E COMPUTACIONAL (ENIAC), 18. , 2021, Evento Online. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2021 . p. 434-445. ISSN 2763-9061. DOI: https://doi.org/10.5753/eniac.2021.18273.

