Porttinari - a Large Multi-genre Treebank for Brazilian Portuguese

Thiago Alexandre Salgueiro Pardo; Magali Sanches Duran; Lucelene Lopes; Ariani Di Felippo; Norton Trevisan Roman; Maria das Graças Volpe Nunes

doi:10.5753/stil.2021.17778

Thiago Alexandre Salgueiro Pardo USP
Magali Sanches Duran USP
Lucelene Lopes USP
Ariani Di Felippo UFSCar
Norton Trevisan Roman USP
Maria das Graças Volpe Nunes USP

DOI: https://doi.org/10.5753/stil.2021.17778

Resumo

This paper presents the project of a large multi-genre treebank for Brazilian Portuguese, called Porttinari. We address relevant research questions in its construction and annotation, reporting the work already done. The treebank is affiliated with the “Universal Dependencies” international model, widely adopted in the area, and must be the basis for the development of state of the art tagging and parsing systems for Portuguese, as well as for conducting linguistic studies on morphosyntax and syntax for this language.

Referências

Afonso, S.; Bick, E.; Haber, R.; Santos, D. (2002). Floresta sintá(c)tica: um treebank para o português. In Anais do XVII Encontro Nacional da Associação Portuguesa de Linguística, pp. 533-545.

Aluísio, S. M.; Pelizzoni, J.; Marchi, A. R.; Oliveira, L.; Manenti, R.; Marquiafável, V. (2003). An account of the challenge of tagging a reference corpus for brazilian portuguese. In the Proceedings of the 6th International Conference on Computational Processing of the Portuguese Language, pp. 110-117.

Artstein, R. and Poesio, M. (2008). Inter-coder agreement for computational linguistics. Computational Linguistics, Vol. 34, N. 4, pp. 555-596.

Belisário, L. B.; Ferreira, L. G.; Pardo, T. A. S. (2020). Evaluating Richer Features and Varied Machine Learning Models for Subjectivity Classification of Book Review Sentences in Portuguese. Information, Vol. 11, N. 9, pp. 1-14.

Branco, A.; Castro, S.; Silva, J.; Costa, F. (2011). CINTIL DepBank Handbook: Design the representation of grammatical dependencies. Technical Report options for DI-FCUL-TR-2011-03. University of Lisbon.

Carrilho, E. and Magro, C. (2010). A anotação sintáctica do CORDIAL-SIN. In A.M. Brito, F. Silva, J. Veloso and A. Fiéis (eds.), XXV Encontro Nacional da Associação Portuguesa de Linguística. Textos seleccionados, pp. 225-241.

Freitas, C.; Rocha, P.; Bick, E. (2008). Floresta Sintá(c)tica: Bigger, Thicker and Easier. In the Proceedings of the 8th International Conference on Computational Processing of the Portuguese Language, pp. 216-219.

Guibon, G.; Courtin, M.; Gerdes, K.; Guillaume, B. (2020). When Collaborative Treebank Curation Meets Graph Grammars: Arborator With a Grew Back-End. In the Proceedings of the 12th Conference on Language Resources and Evaluation, pp. 5291-5300.

Hovy, E. and Lavid, J. (2010). Towards a ‘Science’ of Corpus Annotation: A New Journal of International Methodological Challenge for Corpus Linguistics. Translation, Vol. 22, N. 1, pp. 13-36.

Jurafsky, D. and Martin, J.H. (2008). Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics and Speech Recognition. 2a edição. Prentice Hall.

Marcus, M. P.; Santorini, B.; Marcinkiewicz, M. A. (1993). Building a large annotated corpus of English: the penn treebank. Computational Linguistics, Vol. 19, N. 2, pp. 313-330.

Muniz, M. C. M. (2004). A construção de recursos lingüístico-computacionais para o português do Brasil: o projeto de Unitex-PB. MSc Dissertation. Instituto de Ciências Matemáticas e de Computação, Universidade de São Paulo. 72p.

Nivre, J. (2015). Towards a Universal Grammar for Natural Language Processing. In the Proceedings of the 16th International Conference on Intelligent Text Processing and Computational Linguistics, pp. 3-16.

Nivre, J.; Marnee, M-C.; Ginter, F.; Haji, J.; Manning, C.D.; Pyysalo, S.; Schuster, S.; (2020). Universal Dependencies v2: An Evergrowing Tyers, F.; Zeman, D. Multilingual Treebank Collection. In the Proceedings of the 12nd International Conference on Language Resources and Evaluation, pp. 4034-4043.

Rademaker, A.; Chalub, F.; Real, L.; Freitas, C.; Bick, E.; Paiva, V. (2017). Universal Dependencies for Portuguese. In the Proceedings of the 4th International Conference on Dependency Linguistics, pp. 197-206.

Real, L.; Oshiro, M.; Mafra, A. (2019). B2W-Reviews01 - An open product reviews corpus. In the Proceedings of the XII Symposium in Information and Human Language Technology, pp. 200-208.

Sanguinetti, M.; Bosco, C.; Cassidy, L.; Çetinolu, Ö.; Cignarella, A. T.; Lynn, T.; (2020). Treebanking I.; Ruppenhofer, J.; Seddah, D.; Zeldes, A. Rehbein, user-generated content: a proposal for a unified representation in universal dependencies. In the Proceedings of the 12th International Language Resources and Evaluation Conference, pp. 5240-5250.

Santos, D. and Gasperin, C. (2002). Evaluation of parsed corpora: Experiments in the Third In the Proceedings of user-transparent and user-visible evaluation. International Conference on Language Resources and Evaluation, pp. 597-604.

Silva, F. J. V.; Roman, N. T.; Carvalho, A. M. B. R. (2020). Stock market tweets annotated with emotions. Corpora, Vol. 15, N. 3, pp. 343-354.

Sousa, M. C. P. (2014). O Corpus Tycho Brahe: contribuições para as humanidades digitais no Brasil. Filologia e Linguística Portuguesa, Vol. 16, pp. 53-93.

Souza, E.; Cavalcanti, T.; Silveira, A.; Evelyn, W.; Freitas, C. (2021). Diretivas e documentação de anotação UD em português (e para língua portuguesa). Available at [link].

Straka, M. (2018). UDPipe 2.0 prototype at CoNLL 2018 UD shared task. In the Proceedings of the CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, pp. 197-207.