Skip to main content

Disambiguation of Universal Dependencies Part-of-Speech Tags of Closed Class Words in Portuguese

  • Conference paper
  • First Online:
Intelligent Systems (BRACIS 2023)

Abstract

This paper explores methods to disambiguate Part-of-Speech (PoS) tags for closed class words in Brazilian Portuguese corpora annotated according to the Universal Dependencies annotation model. We evaluate disambiguation methods of different paradigms, namely a Markov-based method, a widely adopted parsing tool, and a BERT-based language modeling method. We compare their performances with two baselines, and observe a significant increase of more than 10% over the baselines for all proposed methods. We also show that while the BERT-based model outperforms the others reaching for the best case a 98% accuracy predicting the correct PoS tag, the use of the three methods as an Ensemble method offers more stable result according to the smaller variance for the numerical results we performed.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 59.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 79.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Similarly to English with the ending -“ly”, in Portuguese it is possible to turn adjectives into adverbs by adding -“mente” at the end. We disconsider those -ly adverbs, as only the primitive adverbs are a closed class.

  2. 2.

    The verbs ser and estar (“to be” in English) are always annotated as AUX, either by being true auxiliary verbs, either by being copula verbs. The verbs “ir”, “haver”, and “ter” (“to go”, “to exist”, and “to have” in English) are sometimes annotated as VERB, sometimes annotated as AUX (as “going to” and “have” + a past participle in English).

  3. 3.

    While adjectives are not a closed class, the adjectives that are ordinal numbers are considered belonging to a closed subset of class ADJ.

  4. 4.

    https://github.com/nilc-nlp/pos-disambiguation.

  5. 5.

    The values stated as average are the macro average of the values of each fold, but since the folds have about the same size, the values for micro and macro average have are practically the same (less than 0.01% difference).

  6. 6.

    For reproducibility purposes, all data (including fold splits) and implementation of all methods are available at https://sites.google.com/icmc.usp.br/poetisa/publications.

References

  1. Afonso, S., Bick, E., Haber, R., Santos, D.: Floresta sintá(c)tica: A treebank for Portuguese. In: Proceedings of the Third International Conference on Language Resources and Evaluation (LREC’02). ELRA, Las Palmas, Canary Islands - Spain (May 2002), http://www.lrec-conf.org/proceedings/lrec2002/pdf/1.pdf

  2. Assunção, J., Fernandes, P., Lopes, L.: Language independent pos-tagging using automatically generated markov chains. In: Proceedings of the 31st International Conference on Software Engineering & Knowledge Engineering, pp. 1–5. Lisbon, Portugal (2019). https://doi.org/10.18293/SEKE2019-097

  3. De Souza, E., Freitas, C.: Polishing the gold-how much revision do we need in treebanks? In: Procedings of the Universal Dependencies Brazilian Festival, pp. 1–11 (2022). https://aclanthology.org/2022.udfestbr-1.2.pdf

  4. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidirectional transformers for language understanding (2018). https://doi.org/10.48550/ARXIV.1810.04805, https://arxiv.org/abs/1810.04805

  5. DiPietro, R., Hager, G.D.: Chapter 21 - deep learning: RNNs and LSTM. In: Zhou, S.K., Rueckert, D., Fichtinger, G. (eds.) Handbook of Medical Image Computing and Computer Assisted Intervention, pp. 503–519. The Elsevier and MICCAI Society Book Series, Academic Press (2020). https://doi.org/10.1016/B978-0-12-816176-0.00026-0

  6. Duran, M., Oliveira, H., Scandarolli, C.: Que simples que nada: a anotação da palavra que em córpus de UD. In: Proceedings of the Universal Dependencies Brazilian Festival, pp. 1–11 (2022). https://aclanthology.org/2022.udfestbr-1.3

  7. Ehsani, R., Alper, M.E., Eryiğit, G., Adali, E.: Disambiguating main POS tags for Turkish. In: Proceedings of the 24th Conference on Computational Linguistics and Speech Processing (ROCLING 2012), pp. 202–213. The Association for Computational Linguistics and Chinese Language Processing (ACLCLP), Chung-Li, Taiwan (2012). https://aclanthology.org/O12-1021

  8. Gers, F.A., Schmidhuber, J.A., Cummins, F.A.: Learning to forget: continual prediction with LSTM. Neural Comput. 12(10), 2451–2471 (2000). https://doi.org/10.1162/089976600300015015

    Article  Google Scholar 

  9. Hoang, M., Bihorac, O.A., Rouces, J.: Aspect-based sentiment analysis using BERT. In: Proceedings of the 22nd Nordic Conference on Computational Linguistics, pp. 187–196. Linköping University Electronic Press, Turku, Finland (2019). https://aclanthology.org/W19-6120

  10. Hoya Quecedo, J.M., Maximilian, K., Yangarber, R.: Neural disambiguation of lemma and part of speech in morphologically rich languages. In: Proceedings of the 12th Language Resources and Evaluation Conference, pp. 3573–3582. European Language Resources Association, Marseille, France (2020). https://aclanthology.org/2020.lrec-1.439

  11. Ide, N., Suderman, K.: Integrating linguistic resources: The American national corpus model. In: Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06). ELRA, Genoa, Italy (2006). http://www.lrec-conf.org/proceedings/lrec2006/pdf/560_pdf.pdf

  12. Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Bengio, Y., LeCun, Y. (eds.) Proceedings of the 3rd International Conference on Learning Representations (2015). http://arxiv.org/abs/1412.6980

  13. Kupiec, J.: Robust part-of-speech tagging using a hidden markov model. Comput. Speech Lang. 6(3), 225–242 (1992). https://www.sciencedirect.com/science/article/pii/088523089290019Z

  14. Lafferty, J.D., McCallum, A., Pereira, F.C.N.: Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: Proceedings of the Eighteenth International Conference on Machine Learning, pp. 282–289. ICML ’01, Morgan Kaufmann Publishers Inc., San Francisco, CA, USA (2001). https://dl.acm.org/doi/10.5555/645530.655813

  15. Liu, Y., Lapata, M.: Text summarization with pretrained encoders. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 3730–3740. Association for Computational Linguistics, Hong Kong, China (2019). https://doi.org/10.18653/v1/D19-1387

  16. Lopes, L., Duran, M., Fernandes, P., Pardo, T.: Portilexicon-ud: a Portuguese lexical resource according to universal dependencies model. In: Proceedings of the Language Resources and Evaluation Conference, pp. 6635–6643. European Language Resources Association, Marseille, France (2022). https://aclanthology.org/2022.lrec-1.715

  17. Lopes, L., Duran, M.S., Pardo, T.A.S.: Universal dependencies-based pos tagging refinement through linguistic resources. In: Proceedings of the 10th Brazilian Conference on Intelligent System. BRACIS’21 (2021). https://link.springer.com/chapter/10.1007/978-3-030-91699-2_41

  18. de Marneffe, M.C., Manning, C.D., Nivre, J., Zeman, D.: Universal Dependencies. Comput. Linguist. 47(2), 255–308 (2021). https://doi.org/10.1162/coli_a_00402, https://aclanthology.org/2021.cl-2.11

  19. Muñoz-Valero, D., Rodriguez-Benitez, L., Jimenez-Linares, L., Moreno-Garcia, J.: Using recurrent neural networks for part-of-speech tagging and subject and predicate classification in a sentence. Int. J. Comput. Intell. Syst. 13, 706–716 (2020). https://doi.org/10.2991/ijcis.d.200527.005

    Article  Google Scholar 

  20. Nivre, J., et al.: Universal Dependencies v1: A multilingual treebank collection. In: Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), pp. 1659–1666. ELRA, Portorož, Slovenia (2016). https://aclanthology.org/L16-1262

  21. Paszke, A., et al.: PyTorch: an imperative style, high-performance deep learning library. In: Wallach, H., Larochelle, H., Beygelzimer, A., d’Alché-Buc, F., Fox, E., Garnett, R. (eds.) Advances in Neural Information Processing Systems 32, pp. 8024–8035. Curran Associates, Inc. (2019). http://papers.neurips.cc/paper/9015-pytorch-an-imperative-style-high-performance-deep-learning-library.pdf

  22. Rademaker, A., Chalub, F., Real, L., Cláudia Freitas, Bick, E., De Paiva, V.: Universal dependencies for Portuguese. In: Proceedings of the Fourth International Conference on Dependency Linguistics (Depling), pp. 197–206 (2017)

    Google Scholar 

  23. Santana, M.: Kaggle - news of the brazilian newspaper. https://www.kaggle.com/marlesson/news-of-the-site-folhauol, accessed: 2021-06-14

  24. Shen, Q., Clothiaux, D., Tagtow, E., Littell, P., Dyer, C.: The role of context in neural morphological disambiguation. In: Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, pp. 181–191. Osaka, Japan (2016). https://aclanthology.org/C16-1018

  25. Silva, E., Pardo, T., Roman, N., Fellipo, A.: Universal dependencies for tweets in brazilian portuguese: Tokenization and part of speech tagging. In: Anais do XVIII Encontro Nacional de Inteligência Artificial e Computacional. pp. 434–445. SBC, Porto Alegre, RS, Brasil (2021). https://doi.org/10.5753/eniac.2021.18273, https://sol.sbc.org.br/index.php/eniac/article/view/18273

  26. Souza, F., Nogueira, R., Lotufo, R.: BERTimbau: pretrained BERT models for Brazilian Portuguese. In: 9th Brazilian Conference on Intelligent Systems, BRACIS, Rio Grande do Sul, Brazil, October 20–23 (2020), https://link.springer.com/chapter/10.1007/978-3-030-61377-8_28

  27. Straka, M.: UDPipe 2.0 prototype at CoNLL 2018 UD shared task. In: Proceedings of the CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, pp. 197–207 (2018). https://aclanthology.org/K18-2020

  28. Straka, M., Straková, J.: Tokenizing, POS tagging, lemmatizing and parsing UD 2.0 with UDPipe. In: Proceedings of the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, pp. 88–99. Association for Computational Linguistics, Vancouver, Canada (2017), https://aclanthology.org/K17-3009

  29. Universal Dependencies: UD Portuguese Bosque - UD version 2. https://universaldependencies.org/treebanks/pt_bosque/index.html. Accessed 14 Jun 2021

  30. Vandenbussche, P.Y., Scerri, T., Jr., R.D.: Word sense disambiguation with transformer models. In: Proceedings of the 6th Workshop on Semantic Deep Learning (SemDeep-6), pp. 7–12. Association for Computational Linguistics, Online (2021) https://aclanthology.org/2021.semdeep-1.2

  31. Wolf, T., et al.: Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp. 38–45. Association for Computational Linguistics, Online (2020). https://www.aclweb.org/anthology/2020.emnlp-demos.6

  32. Zalmout, N., Habash, N.: Don’t throw those morphological analyzers away just yet: Neural morphological disambiguation for Arabic. In: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp. 704–713. Association for Computational Linguistics, Copenhagen, Denmark (2017). https://aclanthology.org/D17-1073

Download references

Acknowledgements

This work was carried out at the Center for Artificial Intelligence (C4AI-USP), with support by the São Paulo Research Foundation (FAPESP grant number 2019/07665-4) and by the IBM Corporation. The project was also supported by the Ministry of Science, Technology and Innovation, with resources of Law N. 8.248, of October 23, 1991, within the scope of PPI-SOFTEX, coordinated by Softex and published as Residence in TIC 13, DOU 01245.010222/2022-44.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Lucelene Lopes .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Lopes, L., Fernandes, P., Inacio, M.L., Duran, M.S., Pardo, T.A.S. (2023). Disambiguation of Universal Dependencies Part-of-Speech Tags of Closed Class Words in Portuguese. In: Naldi, M.C., Bianchi, R.A.C. (eds) Intelligent Systems. BRACIS 2023. Lecture Notes in Computer Science(), vol 14197. Springer, Cham. https://doi.org/10.1007/978-3-031-45392-2_16

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-45392-2_16

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-45391-5

  • Online ISBN: 978-3-031-45392-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics