Yauti: A Tool for Morphosyntactic Analysis of Nheengatu within the Universal Dependencies Framework

Leonel Figueiredo de Alencar

doi:10.5753/stil.2023.234131

Leonel Figueiredo de Alencar UFC https://orcid.org/0000-0001-8148-6994

DOI: https://doi.org/10.5753/stil.2023.234131

Resumo

This paper reports on Yauti, a rule-based morphosyntactic analyzer for the endangered Brazilian indigenous language Nheengatu. Its goal is to generate analyses in the UD framework’s CoNLL-U format. It has been developed on par with the construction of the Nheengatu treebank of the UD collection. In sentences only consisting of known and unambiguous words, the tool generally delivers good results. It obtained a LAS score of 73.2% in a version of the Nheengatu UD treebank with all 1022 sentences automatically provided with XPOS tags and a special annotation to handle non-lexicalized words.

Palavras-chave: Universal Dependencies, Treebank, Corpus Annotation, Dependency Parsing, Morphological Generator, Syntactic Parsing, Morphological Parsing, Automatic Morphosyntactic Analysis, Part-of-Speech Tagging, Low-resource Language, Nheengatu, Tupian

Referências

Alexandre, D. M., Gurgel, J. L., and de A. Araripe, L. F. (2021a). Compilação de um corpus etiquetado da Língua Geral Amazônica. In Anais do XIII Simpósio Brasileiro de Tecnologia da Informação e da Linguagem Humana, pages 427–431, Porto Alegre, RS, Brasil. SBC. https://doi.org/10.5753/stil.2021.17823

Alexandre, D. M., Gurgel, J. L., and de Alencar Araripe, L. F. (2021b). Nheentiquetador: Um etiquetador morfossintático para o sintagma nominal do nheengatu. Revista Encontros Universitários da UFC, 6:1–13. http://www.periodicos.ufc.br/eu/article/view/80646

Avila, M. T. (2021). Proposta de dicionário nheengatu-português. PhD thesis, Faculdade de Filosofia, Letras e Ciências Humanas da Universidade de São Paulo. https://doi.org/10.11606/T.8.2021.tde-10012022-201925.

Bapna, A., Caswell, I., Kreutzer, J., Firat, O., van Esch, D., Siddhant, A., Niu, M., Baljekar, P., Garcia, X., Macherey, W., Breiner, T., Axelrod, V., Riesa, J., Cao, Y., Chen, M. X., Macherey, K., Krikun, M., Wang, P., Gutkin, A., Shah, A., Huang, Y., Chen, Z., Wu, Y., and Hughes, M. (2022). Building machine translation systems for the next thousand languages. Technical report, Google Research. https://doi.org/10.48550/arXiv.2205.03983

da Cruz, A. (2011). Fonologia e gramática do nheengatú: A língua falada pelos povos Baré, Warekena e Baniwa. LOT, Utrecht. [link].

da Silva Facundes, S., de Freitas, M. F. P., and de Lima-Padovani, B. F. S. (2021). Number expression in Apurinã (Arawák). In Hämäläinen, M., Partanen, N., and Alnajjar, K., editors, Multilingual Facilitation, pages 31–42. University of Helsinki Library, Helsinki. https://doi.org/10.31885/9789515150257

de Alencar, L. F. (2021). Uma gramática computacional de um fragmento do nheengatu / A computational grammar for a fragment of nheengatu. Revista de Estudos da Linguagem, 29(3):1717–1777. http://dx.doi.org/10.17851/2237-2083.29.3.1717-1777.

de Almeida Navarro, E. (2016). Curso de Língua Geral (nheengatu ou tupi moderno): A língua das origens da civilização amazônica. Centro Angel Rama da Faculdade de Filosofia, Letras e Ciências Humanas da Universidade de São Paulo, São Paulo, second edition.

de Magalhães, J. V. C. (1876). O selvagem. Typographia da Reforma, Rio de Janeiro. [link].

de Marneffe, M.-C., Manning, C. D., Nivre, J., and Zeman, D. (2021). Universal Dependencies. Computational Linguistics, 47(2):255–308. http://dx.doi.org/10.1162/coli_a_00402 https://aclanthology.org/2021.cl-2.11

Eberhard, D. M., Simons, G. F., and Fennig, C. D., editors (2023). Ethnologue: Languages of the World. SIL International, Dallas, twenty-sixth edition. http://www.ethnologue.com

Freire, J. R. B. (2011). Rio Babel: A história das línguas na Amazônia. EdUERJ, Rio de Janeiro, second edition. [link].

Galves, C., Sandalo, F., Sena, T. A. d., and Veronesi, L. (2017). Annotating a polysynthetic language: From Portuguese to Kadiwéu. Cadernos de Estudos Linguísticos, 59(3):631–648. https://doi.org/10.20396/cel.v59i3.8651003

Gerardi, F. F., Reichert, S., and Aragon, C. C. (2021). TuLeD (tupían lexical database): introducing a database of a South American language family. Language Resources and Evaluation, 55(4):997–1015. https://doi.org/10.1007/s10579-020-09521-5

Martín Rodríguez, L., Merzhevich, T., Silva, W., Tresoldi, T., Aragon, C., and Gerardi, F. F. (2022). Tupían language ressources: Data, tools, analyses. In Proceedings of the 1st Annual Meeting of the ELRA/ISCA Special Interest Group on Under-Resourced Languages, pages 48–58, Marseille, France. European Language Resources Association. https://aclanthology.org/2022.sigul-1.7.pdf

Moore, D. (2014). Historical development of Nheengatu (Língua Geral Amazônica). In Mufwene, S. S., editor, Iberian Imperialism and Language Evolution in Latin America, pages 108–142. University of Chicago Press, Chicago. [link].

Moore, D., Facundes, S., and Pires, N. (1994). Nheengatu (Língua Geral Amazônica), its history, and the effects of language contact. In Proceedings of the Meeting of the Society for the Study of the Indigenous languages of the Americas, July 2-4, 1993 and the Hokan-Penutian workshop, July 3, 1993, Report / Survey of California and other Indian Languages ; 8, pages 93–118, Berkeley, CA. [University of California]. https://escholarship.org/uc/item/7tb981s1

Navarro, E., Ávila, M., and Trevisan, R. (2017). O nheengatu, entre a vida e a morte: A tradução literária como possível instrumento de sua revitalização lexical. Revista Letras Raras, 6(2):9–29. http://dx.doi.org/10.35572/rlr.v6i2.768 [link].

Nivre, J., de Marneffe, M.-C., Ginter, F., Goldberg, Y., Hajič, J., Manning, C. D., McDonald, R., Petrov, S., Pyysalo, S., Silveira, N., Tsarfaty, R., and Zeman, D. (2016). Universal Dependencies v1: A multilingual treebank collection. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), pages 1659–1666, Portorož, Slovenia. European Language Resources Association (ELRA). https://aclanthology.org/L16-1262

Nivre, J. and Fang, C.-T. (2017). Universal Dependency evaluation. In Proceedings of the NoDaLiDa 2017 Workshop on Universal Dependencies (UDW 2017), pages 86–95, Gothenburg, Sweden. Association for Computational Linguistics. https://aclanthology.org/W17-0411

Rodrigues, A. D. (1996). As línguas gerais sul-americanas. Papia, 4(2):6–18. https://repositorio.unb.br/handle/10482/9136

Rodrigues, A. D. and Cabral, A. S. A. C. (2011). A contribution to the linguistic history of the língua geral amazônica. ALFA: Revista de Linguística, 55(2). [link].

Schuster, S. and Manning, C. D. (2016). Enhanced English Universal Dependencies: An improved representation for natural language understanding tasks. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), pages 2371–2378, Portorož, Slovenia. European Language Resources Association (ELRA). https://aclanthology.org/L16-1376

Simons, G. F., Thomas, A. L. L., and White, C. K. K. (2022). Assessing digital language support on a global scale. In Proceedings of the 29th International Conference on Computational Linguistics, pages 4299–4305, Gyeongju, Republic of Korea. International Committee on Computational Linguistics. https://aclanthology.org/2022.coling-1.379

Straka, M. (2018). UDPipe 2.0 prototype at CoNLL 2018 UD shared task. In Proceedings of the CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, pages 197–207, Brussels, Belgium. Association for Computational Linguistics. http://dx.doi.org/10.18653/v1/K18-2020 https://aclanthology.org/K18-2020

Straka, M., Hajič, J., and Straková, J. (2016). UDPipe: Trainable pipeline for processing CoNLL-U files performing tokenization, morphological analysis, POS tagging andparsing. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), pages 4290–4297, Portorož, Slovenia. European Language Resources Association (ELRA). https://aclanthology.org/L16-1680