Yauti: A Tool for Morphosyntactic Analysis of Nheengatu within the Universal Dependencies Framework
Resumo
This paper reports on Yauti, a rule-based morphosyntactic analyzer for the endangered Brazilian indigenous language Nheengatu. Its goal is to generate analyses in the UD framework’s CoNLL-U format. It has been developed on par with the construction of the Nheengatu treebank of the UD collection. In sentences only consisting of known and unambiguous words, the tool generally delivers good results. It obtained a LAS score of 73.2% in a version of the Nheengatu UD treebank with all 1022 sentences automatically provided with XPOS tags and a special annotation to handle non-lexicalized words.
Referências
Alexandre, D. M., Gurgel, J. L., and de Alencar Araripe, L. F. (2021b). Nheentiquetador: Um etiquetador morfossintático para o sintagma nominal do nheengatu. Revista Encontros Universitários da UFC, 6:1–13. http://www.periodicos.ufc.br/eu/article/view/80646
Avila, M. T. (2021). Proposta de dicionário nheengatu-português. PhD thesis, Faculdade de Filosofia, Letras e Ciências Humanas da Universidade de São Paulo. https://doi.org/10.11606/T.8.2021.tde-10012022-201925.
Bapna, A., Caswell, I., Kreutzer, J., Firat, O., van Esch, D., Siddhant, A., Niu, M., Baljekar, P., Garcia, X., Macherey, W., Breiner, T., Axelrod, V., Riesa, J., Cao, Y., Chen, M. X., Macherey, K., Krikun, M., Wang, P., Gutkin, A., Shah, A., Huang, Y., Chen, Z., Wu, Y., and Hughes, M. (2022). Building machine translation systems for the next thousand languages. Technical report, Google Research. https://doi.org/10.48550/arXiv.2205.03983
da Cruz, A. (2011). Fonologia e gramática do nheengatú: A língua falada pelos povos Baré, Warekena e Baniwa. LOT, Utrecht. [link].
da Silva Facundes, S., de Freitas, M. F. P., and de Lima-Padovani, B. F. S. (2021). Number expression in Apurinã (Arawák). In Hämäläinen, M., Partanen, N., and Alnajjar, K., editors, Multilingual Facilitation, pages 31–42. University of Helsinki Library, Helsinki. https://doi.org/10.31885/9789515150257
de Alencar, L. F. (2021). Uma gramática computacional de um fragmento do nheengatu / A computational grammar for a fragment of nheengatu. Revista de Estudos da Linguagem, 29(3):1717–1777. http://dx.doi.org/10.17851/2237-2083.29.3.1717-1777.
de Almeida Navarro, E. (2016). Curso de Língua Geral (nheengatu ou tupi moderno): A língua das origens da civilização amazônica. Centro Angel Rama da Faculdade de Filosofia, Letras e Ciências Humanas da Universidade de São Paulo, São Paulo, second edition.
de Magalhães, J. V. C. (1876). O selvagem. Typographia da Reforma, Rio de Janeiro. [link].
de Marneffe, M.-C., Manning, C. D., Nivre, J., and Zeman, D. (2021). Universal Dependencies. Computational Linguistics, 47(2):255–308. http://dx.doi.org/10.1162/coli_a_00402 https://aclanthology.org/2021.cl-2.11
Eberhard, D. M., Simons, G. F., and Fennig, C. D., editors (2023). Ethnologue: Languages of the World. SIL International, Dallas, twenty-sixth edition. http://www.ethnologue.com
Freire, J. R. B. (2011). Rio Babel: A história das línguas na Amazônia. EdUERJ, Rio de Janeiro, second edition. [link].
Galves, C., Sandalo, F., Sena, T. A. d., and Veronesi, L. (2017). Annotating a polysynthetic language: From Portuguese to Kadiwéu. Cadernos de Estudos Linguísticos, 59(3):631–648. https://doi.org/10.20396/cel.v59i3.8651003
Gerardi, F. F., Reichert, S., and Aragon, C. C. (2021). TuLeD (tupían lexical database): introducing a database of a South American language family. Language Resources and Evaluation, 55(4):997–1015. https://doi.org/10.1007/s10579-020-09521-5
Martín Rodríguez, L., Merzhevich, T., Silva, W., Tresoldi, T., Aragon, C., and Gerardi, F. F. (2022). Tupían language ressources: Data, tools, analyses. In Proceedings of the 1st Annual Meeting of the ELRA/ISCA Special Interest Group on Under-Resourced Languages, pages 48–58, Marseille, France. European Language Resources Association. https://aclanthology.org/2022.sigul-1.7.pdf
Moore, D. (2014). Historical development of Nheengatu (Língua Geral Amazônica). In Mufwene, S. S., editor, Iberian Imperialism and Language Evolution in Latin America, pages 108–142. University of Chicago Press, Chicago. [link].
Moore, D., Facundes, S., and Pires, N. (1994). Nheengatu (Língua Geral Amazônica), its history, and the effects of language contact. In Proceedings of the Meeting of the Society for the Study of the Indigenous languages of the Americas, July 2-4, 1993 and the Hokan-Penutian workshop, July 3, 1993, Report / Survey of California and other Indian Languages ; 8, pages 93–118, Berkeley, CA. [University of California]. https://escholarship.org/uc/item/7tb981s1
Navarro, E., Ávila, M., and Trevisan, R. (2017). O nheengatu, entre a vida e a morte: A tradução literária como possível instrumento de sua revitalização lexical. Revista Letras Raras, 6(2):9–29. http://dx.doi.org/10.35572/rlr.v6i2.768 [link].
Nivre, J., de Marneffe, M.-C., Ginter, F., Goldberg, Y., Hajič, J., Manning, C. D., McDonald, R., Petrov, S., Pyysalo, S., Silveira, N., Tsarfaty, R., and Zeman, D. (2016). Universal Dependencies v1: A multilingual treebank collection. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), pages 1659–1666, Portorož, Slovenia. European Language Resources Association (ELRA). https://aclanthology.org/L16-1262
Nivre, J. and Fang, C.-T. (2017). Universal Dependency evaluation. In Proceedings of the NoDaLiDa 2017 Workshop on Universal Dependencies (UDW 2017), pages 86–95, Gothenburg, Sweden. Association for Computational Linguistics. https://aclanthology.org/W17-0411
Rodrigues, A. D. (1996). As línguas gerais sul-americanas. Papia, 4(2):6–18. https://repositorio.unb.br/handle/10482/9136
Rodrigues, A. D. and Cabral, A. S. A. C. (2011). A contribution to the linguistic history of the língua geral amazônica. ALFA: Revista de Linguística, 55(2). [link].
Schuster, S. and Manning, C. D. (2016). Enhanced English Universal Dependencies: An improved representation for natural language understanding tasks. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), pages 2371–2378, Portorož, Slovenia. European Language Resources Association (ELRA). https://aclanthology.org/L16-1376
Simons, G. F., Thomas, A. L. L., and White, C. K. K. (2022). Assessing digital language support on a global scale. In Proceedings of the 29th International Conference on Computational Linguistics, pages 4299–4305, Gyeongju, Republic of Korea. International Committee on Computational Linguistics. https://aclanthology.org/2022.coling-1.379
Straka, M. (2018). UDPipe 2.0 prototype at CoNLL 2018 UD shared task. In Proceedings of the CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, pages 197–207, Brussels, Belgium. Association for Computational Linguistics. http://dx.doi.org/10.18653/v1/K18-2020 https://aclanthology.org/K18-2020
Straka, M., Hajič, J., and Straková, J. (2016). UDPipe: Trainable pipeline for processing CoNLL-U files performing tokenization, morphological analysis, POS tagging andparsing. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), pages 4290–4297, Portorož, Slovenia. European Language Resources Association (ELRA). https://aclanthology.org/L16-1680