An NLP approach to impersonal –se in Brazilian Portuguese
Resumo
This paper introduces an annotation proposal for the reflexive pronoun –se in Brazilian Portuguese with a view to classifying different strategies for impersonalization through the use of one supercategory. We carried out experiments on a gold standard treebank for Portuguese in the Universal Dependencies project and verified that the implementation of our proposal results in the training of a morphosyntactic annotation model that annotates syntactic dependencies 1.27 percentage point better in accuracy. Moreover, a detailed evaluation showed an increase of up to 6.34 accuracy in the annotation of verb arguments, one of the most important classes for carrying out various Natural Language Processing tasks, highlighting the importance of informed linguistic modeling decisions in practical NLP results.
Referências
Bechara, E. (2012). Moderna gramática portuguesa. Nova Fronteira.
Bechara, E. (2018). Lições de português pela análise sintática. Nova Fronteira.
Bouma, G., Hajic, J., Haug, D., Nivre, J., Solberg, P. E., and Øvrelid, L. (2018). Expletives in Universal Dependency treebanks. In de Marneffe, M.-C., Lynn, T., and Schuster, S., editors, Proceedings of the Second Workshop on Universal Dependencies (UDW 2018), pages 18–26, Brussels, Belgium. Association for Computational Linguistics.
Branco, A., Silva, J., Gomes, L., and Rodrigues, J. (2022). Universal grammatical dependencies for portuguese with cintil data, lx processing and clarin support. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 5617–5626.
Cunha, C. and Cintra, L. (2016). Nova gramática do português contemporâneo. LEXIKON Editora Digital ltda.
De Marneffe, M.-C., Manning, C. D., Nivre, J., and Zeman, D. (2021). Universal dependencies. Computational linguistics, 47(2):255–308.
de Souza, E. and Freitas, C. (2023). Explorando variações no tagset e na anotação universal dependencies (ud) para português: Possibilidades e resultados com base no treebank petrogold. In Anais do XIV Simpósio Brasileiro de Tecnologia da Informação e da Linguagem Humana, pages 125–134. SBC.
de Souza, E., Silveira, A., Cavalcanti, T., Castro, M., and Freitas, C. (2021). Petrogold – corpus padrão ouro para o domínio do petróleo. In Anais do XIII Simpósio Brasileiro de Tecnologia da Informação e da Linguagem Humana, pages 29–38, Porto Alegre, RS, Brasil. SBC.
Degraeuwe, J. and Goethals, P. (2020). Reflexive pronouns in Spanish Universal Dependencies. PROCESAMIENTO DEL LENGUAJE NATURAL, 64(64):77–84.
Duran, M. S., Lopes, L., Nunes, M. d. G. V., and Pardo, T. A. S. (2023). The dawn of the porttinari multigenre treebank: introducing its journalistic portion. Anais.
Magalhães, H. L. P. and Carvalho, H. M. d. (2021). Uso variável da concordância verbal em construções de voz passiva sintética na escrita de textos jornalísticos cearenses.
Marković, S. and Zeman, D. (2018). Reflexives in universal dependencies.
Rademaker, A., Chalub, F., Real, L., Freitas, C., Bick, E., and De Paiva, V. (2017). Universal dependencies for Portuguese. In Proceedings of the Fourth International Conference on Dependency Linguistics (Depling 2017), pages 197–206.
Straka, M., Hajic, J., and Straková, J. (2016). UDPipe: trainable pipeline for processing CoNLL-U files performing tokenization, morphological analysis, pos tagging and parsing. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), pages 4290–4297.
Zeman, D., Hajic, J., Popel, M., Potthast, M., Straka, M., Ginter, F., Nivre, J., and Petrov, S. (2018). Conll 2018 shared task: Multilingual parsing from raw text to universal dependencies. In Proceedings of the CoNLL 2018 Shared Task: Multilingual parsing from raw text to universal dependencies, pages 1–21.
Zeman, D., Popel, M., Straka, M., Hajic, J., Nivre, J., Ginter, F., Luotolahti, J., Pyysalo, S., Petrov, S., Potthast, M., Tyers, F., Badmaeva, E., Gokirmak, M., Nedoluzhko, A., Cinkova, S., Hajic jr., J., Hlavacova, J., Kettnerová, V., Uresova, Z., Kanerva, J., Ojala, S., Missilä, A., Manning, C. D., Schuster, S., Reddy, S., Taji, D., Habash, N., Leung, H., de Marneffe, M.-C., Sanguinetti, M., Simi, M., Kanayama, H., dePaiva, V., Droganova, K., Martínez Alonso, H., Çöltekin, c., Sulubacak, U., Uszkoreit, H., Macketanz, V., Burchardt, A., Harris, K., Marheinecke, K., Rehm, G., Kayadelen, T., Attia, M., Elkahky, A., Yu, Z., Pitler, E., Lertpradit, S., Mandl, M., Kirchner, J., Alcalde, H. F., Strnadová, J., Banerjee, E., Manurung, R., Stella, A., Shimada, A., Kwak, S., Mendonca, G., Lando, T., Nitisaroj, R., and Li, J. (2017). Conll 2017 shared task: Multilingual parsing from raw text to universal dependencies. In Proceedings of the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, pages 1–19, Vancouver, Canada. Association for Computational Linguistics.