Genipapo - A Multigenre Dependency Parser for Brazilian Portuguese

Ariani Di Felippo; Norton T. Roman; Bryan K. S. Barbosa; Thiago A. S. Pardo

doi:10.5753/stil.2024.245415

Ariani Di Felippo USP / UFSCar http://orcid.org/0000-0002-4566-9352
Norton T. Roman USP https://orcid.org/0000-0002-0563-2045
Bryan K. S. Barbosa USP / UFSCar https://orcid.org/0000-0002-4637-6498
Thiago A. S. Pardo USP https://orcid.org/0000-0003-2111-1319

DOI: https://doi.org/10.5753/stil.2024.245415

Resumo

Neste artigo, apresenta-se um esforço pioneiro para o desenvolvimento de um modelo de parsing multigênero para o português brasileiro. Seguindo o projeto Universal Dependencies, treinou-se um dos modelos do estado-da-arte em três corpora gold-standard de diferentes gêneros textuais (jornalístico, acadêmico e conteúdo gerado por usuário – postagens do X). Os experimentos revelam que nosso modelo multigênero de parsing produz resultados melhores ou competitivos em relação aos modelos de gênero único.

Palavras-chave: dependency parser, multigenre, Universal Dependencies, Brazilian Portuguese

Referências

Bai, J., Wang, Y., Chen, Y., Yang, Y., Bai, J., Yu, J., and Tong, Y. (2021). Syntax-BERT: Improving pre-trained transformers with syntax trees. In Proceedings of the 16th Conference of the EACL, p. 3011–3020.

Barbosa, B. K. d. S. (2024). Descrição sintático-semântica de nomes predicadores em tweets do mercado financeiro em português. Master’s thesis, Programa de Pós-Gradução em Linguísica, Universidade Federal de São Carlos.

Bick, E. (2000). The Parsing System “Palavras”. Automatic Grammatical Analysis of Portuguese in a Constraint Grammar Framework. University of Arhus, Arhus.

Bölücü, N., Rybinski, M., and Wan, S. (2023). Investigating the impact of syntax-enriched transformers on quantity extraction in scientific texts. In Proceedings of the 2nd Workshop on Information Extraction from Scientific Publications, p. 1–13, Bali.

Candido, A., Maziero, E., Specia, L., Gasperin, C., Pardo, T., and Aluisio, S. (2009). Supporting the adaptation of texts for poor literacy readers: a text simplification editor for Brazilian Portuguese. In Proceedings of the 4th Workshop on Innovative Use of NLP for Building Educational Applications, p. 34–42, Boulder, Colorado.

da Silva, F. J. V., Roman, N. T., and Carvalho, A. M. B. R. (2020). Stock market tweets annotated with emotions. Corpora, 15(3):343–354.

de Marneffe, M.C., Manning, C. D., Nivre, J., and Zeman, D. (2021). Universal Dependencies. Computational Linguistics, 47(2):255–308.

de Souza, E. and Freitas, C. (2023). Explorando variações no tagset e na anotação Universal Dependencies (ud) para Português: Possibilidades e resultados com base no tree-bank petrogold. In Anais do XIV Simpósio Brasileiro de Tecnologia da Informação e da Linguagem Humana, p. 125–134, Porto Alegre, RS, Brasil. SBC. DOI: 10.5753/stil.2023.233980

Di-Felippo, A., Postali, C., Ceregatto, G., Gazana, L., Silva, E., Roman, N., and Pardo, T. (2021). Descrição preliminar do corpus DANTEStocks: diretrizes de segmentação para anotação segundo Universal Dependencies. In Anais do XIII Simpósio Brasileiro de Tecnologia da Informação e da Linguagem Humana, p. 335–343, Porto Alegre, RS, Brasil. SBC. DOI: 10.5753/stil.2021.17813

Duran, M., Lopes, L., Nunes, M. d. G. V., and Pardo, T. A. S. (2023a). The dawn of the Porttinari multigenre treebank: introducing its journalistic portion. In Proceedings of the XIV Brazilian Symposium in Information and Human Language Technology (STIL), p. 115–124, Porto Alegre, RS, Brasil. SBC. DOI: 10.5753/stil.2023.233975

Duran, M., Nunes, M. d. G. V., and Pardo, T. A. S. (2023b). Construções sintáticas do português que desafiam a tarefa de parsing: uma análise qualitativa. In Proceedings of the 2nd Universal Dependencies Brazilian Festival (UDFest-BR), p. 424–433, Porto Alegre, RS, Brasil. SBC. DOI: 10.5753/stil.2023.25483

Duran, M. S. (2022). Manual de anotação de relações de dependência - versão revisada e estendida: orientações para anotação de relações de dependência sintática em língua portuguesa, seguindo as diretrizes da abordagem Universal Dependencies (UD).

Gomes, D. S. M., Cordeiro, F. C., Consoli, B. S., Santos, N. L., Moreira, V. P., Vieira, R., Moraes, S., and Evsukoff, A. G. (2021). Portuguese word embeddings for the oil and gas industry: Development and evaluation. Computers in Industry, 124:103347. DOI: 10.1016/j.compind.2020.103347

Jurafsky, D. and Martin, J. H. (2024). Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition with Language Models. Online manuscript released August 20, 2024.

Kondratyuk, D. and Straka, M. (2019). 75 languages, 1 model: Parsing Universal Dependencies universally. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), p. 2779–2795, Hong Kong, China. ACL. DOI: 10.18653/v1/D19-1279

Lin, Y., Wang, C., Song, H., and Li, Y. (2021). Multi-head self-attention transformation networks for aspect-based sentiment analysis. IEEE Access, 9:8762–8770.

Lopes, L. and Pardo, T. (2024). Towards portparser - a highly accurate parsing system for Brazilian Portuguese following the Universal Dependencies framework. In Proceedings of the 16th International Conference on Computational Processing of Portuguese - Vol. 1, p. 401–410, Santiago de Compostela, Galicia/Spain. ACL.

Martins, R. T., Hasegawa, R., Nunes, M. d. G. V., Montilha, G., and Oliveira, O. N. (1998). Linguistic issues in the development of regra: A grammar checker for brazilian portuguese. Natural Language Engineering, 4(4):287–307. DOI: 10.1017/S135132499800206X

Nivre, J. and Fang, C.-T. (2017). Universal Dependency evaluation. In Proceedings of the NoDaLiDa 2017 Workshop on Universal Dependencies (UDW 2017), p. 86–95, Gothenburg, Sweden. ACL. DOI: 10.1162/coli_a_00402

Nivre, J., de Marneffe, M.-C., Ginter, F., Hajič, J., Manning, C. D., Pyysalo, S., Schuster, S., Tyers, F., and Zeman, D. (2020). Universal Dependencies v2: An evergrowing multilingual treebank collection. In Proceedings of the Twelfth Language Resources and Evaluation Conference, p. 4034–4043, Marseille, France. ELRA.

Pardo, T., Duran, M., Lopes, L., Felippo, A., Roman, N., and Nunes, M. (2021). Porttinari - a large multi-genre treebank for brazilian portuguese. In Proceedings of the XIII Brazilian Symposium in Information and Human Language Technology, p. 1–10, Porto Alegre, RS, Brasil. SBC. DOI: 10.5753/stil.2021.17778

Qi, P., Zhang, Y., Zhang, Y., Bolton, J., and Manning, C. D. (2020). Stanza: A python natural language processing toolkit for many human languages. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, p. 101–108, Online. ACL. DOI: 10.18653/v1/2020.acl-demos.14

Rademaker, A., Chalub, F., Real, L., Freitas, C., Bick, E., and de Paiva, V. (2017). Universal Dependencies for Portuguese. In Proceedings of the Fourth International Conference on Dependency Linguistics (Depling 2017), p. 197–206, Pisa, Italy. Linköping University Electronic Press.

Sanguinetti, M., Bosco, C., Cassidy, L., and et al. (2023). Treebanking user-generated content: a ud based overview of guidelines, corpora and unified recommendations. Language Resources Evaluation, 57:493–544. DOI: 10.1007/s10579-022-09581-9

Souza, E., Silveira, A., Cavalcanti, T., Castro, M., and Freitas, C. (2021). Petrogold – corpus padrão ouro para o domínio do petróleo. In Anais do XIII Simpósio Brasileiro de Tecnologia da Informação e da Linguagem Humana, p. 29–38, Porto Alegre, RS, Brasil. SBC. DOI: 10.5753/stil.2021.17781

Souza, F., Nogueira, R., and Lotufo, R. (2020). Bertimbau: Pretrained bert models for brazilian portuguese. In Intelligent Systems, p. 403–417, Cham. Springer International Publishing.

Straka, M. (2018). UDPipe 2.0 prototype at CoNLL 2018 UD shared task. In Proceedings of the CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, p. 197–207, Brussels, Belgium. ACL. DOI: 10.18653/v1/K18-2020

Zeman, D. e. a. (2017). CoNLL 2017 shared task: Multilingual parsing from raw text to Universal Dependencies. In Proceedings of the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, p. 1–19, Vancouver, Canada. ACL. DOI: 10.18653/v1/K17-3001

Zhou, J., Zhang, Z., Zhao, H., and Zhang, S. (2020). LIMIT-BERT: Linguistics informed multi-task BERT. In Findings of the Association for Computational Linguistics: EMNLP 2020, p. 4450–4461.

Zilio, L., Wilkens, R., and Fairon, C. (2018). Passport: A dependency parsing model for portuguese. In Computational Processing of the Portuguese Language, p. 479–489, Cham. Springer International Publishing.