Explorando variações no tagset e na anotação Universal Dependencies (UD) para Português: Possibilidades e resultados com base no treebank PetroGold

Elvis de Souza; Cláudia Freitas

doi:10.5753/stil.2023.233980

Elvis de Souza PUC-Rio http://orcid.org/0000-0001-9373-7412
Cláudia Freitas PUC-Rio https://orcid.org/0000-0001-6807-8558

DOI: https://doi.org/10.5753/stil.2023.233980

Resumo

O artigo analisa variações no PetroGold, um treebank padrão ouro desenvolvido para o processamento de linguagem natural (PLN). Os resultados mostram que considerar a classe gramatical das expressões multipalavras na anotação de todas as palavras que as compõem, assim como simplificar o tagset sintático do treebank, produz modelos com melhor desempenho em algumas métricas, destacando a importância da modelagem linguística durante a anotação para resultados adequados no PLN. Os datasets utilizados no estudo estão disponíveis em um repositório dedicado, podendo ser ainda mais modificados para treinar melhores modelos de linguagem.

Palavras-chave: PetroGold, anotação de treebank, dataset em português, representação linguística para o PLN, Universal Dependencies para o português, Processamento de Linguagem Natural

Referências

Artstein, R. (2017). Inter-annotator agreement. In Handbook of linguistic annotation, pages 297–313. Springer. https://doi.org/10.1007/978-94-024-0881-2_11

Bick, E. (2014). PALAVRAS, a constraint grammar-based parsing system for Portuguese. Working with Portuguese corpora, pages 279–302.

de Marneffe, M.-C., Manning, C. D., Nivre, J., and Zeman, D. (2021). Universal dependencies. Computational linguistics, 47(2):255–308. https://doi.org/10.1162/coli_a_00402

de Souza, E. (2023). Construção e avaliação de um treebank padrão ouro. Mestrado, PUC-Rio. https://doi.org/10.17771/PUCRio.acad.62693

de Souza, E. and Freitas, C. (2021). ET: A workstation for querying, editing and evaluating annotated corpora. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 35–41, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics. https://doi.org/10.18653/v1/2021.emnlp-demo.5 https://aclanthology.org/2021.emnlp-demo.5

de Souza, E. and Freitas, C. (2022a). Polishing the gold–how much revision do we need in treebanks? In Procedings of the Universal Dependencies Brazilian Festival, pages 1–11. https://aclanthology.org/2022.udfestbr-1.2

de Souza, E. and Freitas, C. (2022b). Still on arguments and adjuncts: the status of the indirect object and the adverbial adjunct relations in Universal Dependencies for Portuguese. In Proceedings of the Universal Dependencies Brazilian Festival, pages 1–10, Fortaleza, Brazil. Association for Computational Linguistics. https://aclanthology.org/2022.udfestbr-1.5

de Souza, E. and Freitas, C. (2023a). Annotation of fixed multiword expressions (mwes) in a portuguese universal dependencies (ud) treebank: Gathering candidates from three different sources. In Proceedings of the II Universal Dependencies Brazilian Festival (UDFest-BR).

de Souza, E. and Freitas, C. (2023b). Avaliação da anotação automática de dependências sintáticas. Revista da ABRALIN.

de Souza, E. and Freitas, C. (2023c). Um pronome com muitas funções: Descrição e resultados da anotação do pronome -se em um treebank segundo o esquema universal dependencies (ud) para português. In VIII Jornada de Descrição do Português, STIL 2023.

de Souza, E., Silveira, A., Cavalcanti, T., Castro, M. C., and Freitas, C. (2021). PetroGold–Corpus padrão ouro para o domínio do petróleo. In Anais do XIII Simpósio Brasileiro de Tecnologia da Informação e da Linguagem Humana, pages 29–38. SBC. https://doi.org/10.5753/stil.2021.17781

Freitas, C. (2023). Dataset e corpus. In Caseli, H. and Volpe Nunes, M. d. G., editors, Processamento de Linguagem Natural: conceitos, técnicas e aplicações em Português, pages –. BPLN.

Freitas, C. and de Souza, E. (2023). A study on methods for revising dependency treebanks: in search of gold. Language Resources and Evaluation, pages 1–21. https://doi.org/10.1007/s10579-023-09653-4

Lewkowycz, A., Andreassen, A., Dohan, D., Dyer, E., Michalewski, H., Ramasesh, V., Slone, A., Anil, C., Schlag, I., Gutman-Solo, T., et al. (2022). Solving quantitative reasoning problems with language models. Advances in Neural Information Processing Systems, 35:3843–3857.

Lopes, L., Duran, M. S., Fernandes, P., and Pardo, T. (2022). Portilexicon-ud: a portuguese lexical resource according to universal dependencies model. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 6635–6643.

Qi, P., Zhang, Y., Zhang, Y., Bolton, J., and Manning, C. D. (2020). Stanza: A Python natural language processing toolkit for many human languages. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations. https://nlp.stanford.edu/pubs/qi2020stanza.pdf

Samuel, D., Kutuzov, A., Øvrelid, L., and Velldal, E. (2023). Trained on 100 million words and still in shape: BERT meets British National Corpus. In Findings of the Association for Computational Linguistics: EACL 2023, pages 1954–1974, Dubrovnik, Croatia. Association for Computational Linguistics. https://aclanthology.org/2023.findings-eacl.146

Souza, F., Nogueira, R., and Lotufo, R. (2020). Bertimbau: Pretrained bert models for brazilian portuguese. In Cerri, R. and Prati, R. C., editors, Intelligent Systems, pages 403–417, Cham. Springer International Publishing.

Straka, M., Hajic, J., and Straková, J. (2016). UDPipe: trainable pipeline for processing CoNLL-U files performing tokenization, morphological analysis, pos tagging and parsing. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), pages 4290–4297.

Zeman, D., Hajic, J., Popel, M., Potthast, M., Straka, M., Ginter, F., Nivre, J., and Petrov, S. (2018). CoNLL 2018 shared task: Multilingual parsing from raw text to universal dependencies. In Proceedings of the CoNLL 2018 Shared Task: Multilingual parsing from raw text to universal dependencies, pages 1–21.