Porttinari - a Large Multi-genre Treebank for Brazilian Portuguese

  • Thiago Alexandre Salgueiro Pardo USP
  • Magali Sanches Duran USP
  • Lucelene Lopes USP
  • Ariani Di Felippo UFSCar
  • Norton Trevisan Roman USP
  • Maria das Graças Volpe Nunes USP

Resumo


This paper presents the project of a large multi-genre treebank for Brazilian Portuguese, called Porttinari. We address relevant research questions in its construction and annotation, reporting the work already done. The treebank is affiliated with the “Universal Dependencies” international model, widely adopted in the area, and must be the basis for the development of state of the art tagging and parsing systems for Portuguese, as well as for conducting linguistic studies on morphosyntax and syntax for this language.

Referências

Afonso, S.; Bick, E.; Haber, R.; Santos, D. (2002). Floresta sintá(c)tica: um treebank para o português. In Anais do XVII Encontro Nacional da Associação Portuguesa de Linguística, pp. 533-545.

Aluísio, S. M.; Pelizzoni, J.; Marchi, A. R.; Oliveira, L.; Manenti, R.; Marquiafável, V. (2003). An account of the challenge of tagging a reference corpus for brazilian portuguese. In the Proceedings of the 6th International Conference on Computational Processing of the Portuguese Language, pp. 110-117.

Artstein, R. and Poesio, M. (2008). Inter-coder agreement for computational linguistics. Computational Linguistics, Vol. 34, N. 4, pp. 555-596.

Belisário, L. B.; Ferreira, L. G.; Pardo, T. A. S. (2020). Evaluating Richer Features and Varied Machine Learning Models for Subjectivity Classification of Book Review Sentences in Portuguese. Information, Vol. 11, N. 9, pp. 1-14.

Branco, A.; Castro, S.; Silva, J.; Costa, F. (2011). CINTIL DepBank Handbook: Design the representation of grammatical dependencies. Technical Report options for DI-FCUL-TR-2011-03. University of Lisbon.

Carrilho, E. and Magro, C. (2010). A anotação sintáctica do CORDIAL-SIN. In A.M. Brito, F. Silva, J. Veloso and A. Fiéis (eds.), XXV Encontro Nacional da Associação Portuguesa de Linguística. Textos seleccionados, pp. 225-241.

Freitas, C.; Rocha, P.; Bick, E. (2008). Floresta Sintá(c)tica: Bigger, Thicker and Easier. In the Proceedings of the 8th International Conference on Computational Processing of the Portuguese Language, pp. 216-219.

Guibon, G.; Courtin, M.; Gerdes, K.; Guillaume, B. (2020). When Collaborative Treebank Curation Meets Graph Grammars: Arborator With a Grew Back-End. In the Proceedings of the 12th Conference on Language Resources and Evaluation, pp. 5291-5300.

Hovy, E. and Lavid, J. (2010). Towards a ‘Science’ of Corpus Annotation: A New Journal of International Methodological Challenge for Corpus Linguistics. Translation, Vol. 22, N. 1, pp. 13-36.

Jurafsky, D. and Martin, J.H. (2008). Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics and Speech Recognition. 2a edição. Prentice Hall.

Marcus, M. P.; Santorini, B.; Marcinkiewicz, M. A. (1993). Building a large annotated corpus of English: the penn treebank. Computational Linguistics, Vol. 19, N. 2, pp. 313-330.

Muniz, M. C. M. (2004). A construção de recursos lingüístico-computacionais para o português do Brasil: o projeto de Unitex-PB. MSc Dissertation. Instituto de Ciências Matemáticas e de Computação, Universidade de São Paulo. 72p.

Nivre, J. (2015). Towards a Universal Grammar for Natural Language Processing. In the Proceedings of the 16th International Conference on Intelligent Text Processing and Computational Linguistics, pp. 3-16.

Nivre, J.; Marnee, M-C.; Ginter, F.; Haji, J.; Manning, C.D.; Pyysalo, S.; Schuster, S.; (2020). Universal Dependencies v2: An Evergrowing Tyers, F.; Zeman, D. Multilingual Treebank Collection. In the Proceedings of the 12nd International Conference on Language Resources and Evaluation, pp. 4034-4043.

Rademaker, A.; Chalub, F.; Real, L.; Freitas, C.; Bick, E.; Paiva, V. (2017). Universal Dependencies for Portuguese. In the Proceedings of the 4th International Conference on Dependency Linguistics, pp. 197-206.

Real, L.; Oshiro, M.; Mafra, A. (2019). B2W-Reviews01 - An open product reviews corpus. In the Proceedings of the XII Symposium in Information and Human Language Technology, pp. 200-208.

Sanguinetti, M.; Bosco, C.; Cassidy, L.; Çetinolu, Ö.; Cignarella, A. T.; Lynn, T.; (2020). Treebanking I.; Ruppenhofer, J.; Seddah, D.; Zeldes, A. Rehbein, user-generated content: a proposal for a unified representation in universal dependencies. In the Proceedings of the 12th International Language Resources and Evaluation Conference, pp. 5240-5250.

Santos, D. and Gasperin, C. (2002). Evaluation of parsed corpora: Experiments in the Third In the Proceedings of user-transparent and user-visible evaluation. International Conference on Language Resources and Evaluation, pp. 597-604.

Silva, F. J. V.; Roman, N. T.; Carvalho, A. M. B. R. (2020). Stock market tweets annotated with emotions. Corpora, Vol. 15, N. 3, pp. 343-354.

Sousa, M. C. P. (2014). O Corpus Tycho Brahe: contribuições para as humanidades digitais no Brasil. Filologia e Linguística Portuguesa, Vol. 16, pp. 53-93.

Souza, E.; Cavalcanti, T.; Silveira, A.; Evelyn, W.; Freitas, C. (2021). Diretivas e documentação de anotação UD em português (e para língua portuguesa). Available at [link].

Straka, M. (2018). UDPipe 2.0 prototype at CoNLL 2018 UD shared task. In the Proceedings of the CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, pp. 197-207.
Publicado
29/11/2021
PARDO, Thiago Alexandre Salgueiro; DURAN, Magali Sanches; LOPES, Lucelene; FELIPPO, Ariani Di; ROMAN, Norton Trevisan; NUNES, Maria das Graças Volpe. Porttinari - a Large Multi-genre Treebank for Brazilian Portuguese. In: SIMPÓSIO BRASILEIRO DE TECNOLOGIA DA INFORMAÇÃO E DA LINGUAGEM HUMANA (STIL), 13. , 2021, Evento Online. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2021 . p. 1-10. DOI: https://doi.org/10.5753/stil.2021.17778.