Analysis of a Brazilian Indigenous corpus using machine learning methods
Resumo
In Brazil, several minority languages suffer a serious risk of extinction. The appropriate documentation of such languages is a fundamental step to avoid that. However, for some of those languages, only a small amount of text corpora is digitally accessible. Meanwhile there are many issues related to the identification of indigenous languages, which may help to identify key similarities among them, as well as to connect related languages and dialects. Therefore, this paper proposes to study and automatically classify 26 neglected Brazilian native languages, considering a small amount of training data, under a supervised and unsupervised setting. Our findings indicate that the use of machine learning models to the analysis of Brazilian Indigenous corpora is very promising, and we hope this work encourage more research on this topic in the next years.
Referências
Angelo (2016). 26 versões da bíblia em idiomas indígenas para mysword.
Bhattu, S. N. and Ravi, V. (2015). Language identification in mixed script social media text. In Fire workshops, pages 37–39.
Brüzzi, A. A. d. S. (1967). Observações gramaticais da língua daxseyé ou tucano. Centro de Pesquisas de Iauaretê.
Buitinck, L., Louppe, G., Blondel, M., Pedregosa, F., Mueller, A., Grisel, O., Niculae, V., Prettenhofer, P., Gramfort, A., Grobler, J., Layton, R., VanderPlas, J., Joly, A., Holt, B., and Varoquaux, G. (2013). API design for machine learning software: experiences from the scikit-learn project. In ECML PKDD Workshop: Languages for Data Mining and Machine Learning, pages 108–122.
Cavnar, W. B., Trenkle, J. M., et al. (1994). N-gram-based text categorization. In Proceedings of SDAIR-94, 3rd annual symposium on document analysis and information retrieval, volume 161175. Citeseer.
Çöltekin, Ç . and Rama, T. (2016). Discriminating similar languages with linear svms and neural networks. In Proceedings of the Third Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial3), pages 15–24.
Dadgar, S. M. H., Araghi, M. S., and Farahani, M. M. (2016). A novel text mining approach based on tf-idf and support vector machine for news classification. In 2016 IEEE International Conference on Engineering and Technology (ICETECH), pages 112–116.
Drude, S., Jr, N. G., and Galucio, A. V. (2007). Avanços da documentação sobre línguas indígenas no Brasil. page 4.
Fleming, L. (2009). Indigenous language literacies of the northwest amazon. Working Papers in Educational Linguistics (WPEL), 24(1):3.
Gebre, B. G., Zampieri, M., Wittenburg, P., and Heskes, T. (2013). Improving native language identification with tf-idf weighting. In the 8th NAACL Workshop on Innovative Use of NLP for Building Educational Applications (BEA8), pages 216–223.
Jauhiainen, T., Lindén, K., and Jauhiainen, H. (2019a). Discriminating between Mandarin Chinese and Swiss-German varieties using adaptive language models. In Proceedings of the Sixth Workshop on NLP for Similar Languages, Varieties and Dialects, pages 178–187, Ann Arbor, Michigan. Association for Computational Linguistics.
Jauhiainen, T. S., Lui, M., Zampieri, M., Baldwin, T., and Lindén, K. (2019b). Automatic language identification in texts: A survey. Journal of Artificial Intelligence Research, 65:675–782.
Kadhim, A. I. (2019). Term weighting for feature extraction on twitter: A comparison between bm25 and tf-idf. In 2019 International Conference on Advanced Science and Engineering (ICOASE), pages 124–128.
Kann, K., Mager, M., Meza-Ruiz, I., and Schütze, H. (2018). Fortification of neural morphological segmentation models for polysynthetic minimal-resource languages. arXiv preprint arXiv:1804.06024.
Krijthe, J. H. and Van der Maaten, L. (2015). Rtsne: T-distributed stochastic neighbor embedding using barnes-hut implementation. R package version 0.13, URL https://github.com/jkrijthe/Rtsne.
Li, Y. and Shen, B. (2017). Research on sentiment analysis of microblogging based on lsa and tf-idf. In 2017 3rd IEEE International Conference on Computer and Communications (ICCC), pages 2584–2588.
Linares, A. E. and Oncevay-Marcos, A. (2017). A low-resourced peruvian language identification model. In CEUR Workshop Proceedings. CEUR-WS.
LJPvd, M. and Hinton, G. (2008). Visualizing high-dimensional data using t-sne. J Mach Learn Res, 9:2579–2605.
Malmasi, S., Dras, M., et al. (2015). Automatic language identification for persian and dari texts. In Proceedings of PACLING, pages 59–64.
Moore, D. and Galucio, A. V. (2016). 2. perspectives for the documentation of indigenous languages in brazil. In Language documentation and revitalization in Latin American contexts, pages 29–58. De Gruyter Mouton.
Moore, D., Galucio, A. V., and Gabas Jr, N. (2008). O desafio de documentar e preservar as línguas amazônicas. Scientific American Brasil, 3:36–43.
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., and Duchesnay, E. (2011). Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830.
Selamat, A. and Akosu, N. (2016). Word-length algorithm for language identification of under-resourced languages. Journal of King Saud University-Computer and Information Sciences, 28(4):457–469.
Severo, C. G. and Makoni, S. B. (2014). Discourses of language in colonial and postcolonial brazil. Language & Communication, 34:95–104.
Tan, L., Zampieri, M., Ljubesic, N., and Tiedemann, J. (2014). Merging comparable data sources for the discrimination of similar languages: The dsl corpus collection. In Proceedings of the 7th Workshop on Building and Using Comparable Corpora (BUCC), pages 11–15. Citeseer.
Van Der Maaten, L. (2014). Accelerating t-sne using tree-based algorithms. The Journal of Machine Learning Research, 15(1):3221–3245.
Wan, A. (2016). Leveraging data-driven methods in word-level language identification for a multilingual alpine heritage corpus. In Proceedings of the Workshop on Multilingual and Cross-lingual Methods in NLP, pages 45–54.
Xiong, C., Hua, Z., Lv, K., and Li, X. (2016). An improved k-means text clustering algorithm by optimizing initial cluster centers. In 2016 7th International Conference on Cloud Computing and Big Data (CCBD), pages 265–268. IEEE.
Yamamoto, M. and Church, K. W. (2001). Using suffix arrays to compute term frequency and document frequency for all substrings in a corpus. Computational Linguistics, 27(1):1–30.
Zampieri, M., Tan, L., Ljubesíc, N., Tiedemann, J., and Nakov, P. (2015). Overview of the dsl shared task 2015. In Proceedings of the Joint Workshop on Language Technology for Closely Related Languages, Varieties and Dialects, pages 1–9.