Tackling neural machine translation in low-resource settings: a Portuguese case study

  • Arthur T. Estrella UFRJ
  • João B. O. Souza Filho UFRJ

Resumo


Neural machine translation (NMT) nowadays requires an increasing amount of data and computational power, so succeeding in this task with limited data and using a single GPU might be challenging. Strategies such as the use of pre-trained word embeddings, subword embeddings, and data augmentation solutions can potentially address some issues faced in low-resource experimental settings, but their impact on the quality of translations is unclear. This work evaluates some of these strategies on two low-resource experiments beyond just reporting BLEU: errors are categorized on the Portuguese-English pair with the help of a translator, considering semantic and syntactic aspects. The BPE subword approach has shown to be the most effective solution, allowing a BLEU increase of 59% p.p. compared to the standard Transformer.

Referências

Bojanowski, P., Grave, E., Joulin, A., and Mikolov, T. (2016). Enriching word vectors with subword information. CoRR, abs/1607.04606.

Cettolo, M., Girardi, C., and Federico, M. (2012). WIT3: Web inventory of transcribed and translated talks. Proc. of EAMT, pages 261–268.

Council, E. The CEFR levels council of europe (coe). https://tinyurl.com/cefrlcoe. Accessed: 2021-08-12.

Hartmann, N., Fonseca, E. R., Shulby, C., et al. (2017). Portuguese word embeddings: Evaluating on word analogies and natural language tasks. CoRR, abs/1708.06025.

Hu, Z., Shi, H., Tan, B., et al. (2019). Texar: A modularized, versatile, and extensible toolkit for text generation. In ACL 2019, System Demonstrations.

Loper, E. and Bird, S. (2002). Nltk: The natural language toolkit. In Proc. of the ACL Workshop on Effective Tools for Teaching Natural Language Processing.

Paszke, A., Gross, S., Massa, F., et al. (2019). Pytorch: An imperative style, highperformance deep learning library. In Advances in Neural Information Proc. Systems 32, pages 8024–8035. Curran Associates, Inc.

Poncelas, A., Shterionov, D. S., Way, A., et al. (2018). Investigating Back-translation in neural machine translation. CoRR, abs/1804.06189.

Post, M. (2018). A call for clarity in reporting BLEU scores. In Proc. of the Third Conference on Machine Translation: Research Papers, pages 186–191, Belgium, Brussels. Assoc. for Comp. Linguistics.

Qi, Y., Sachan, D., Felix, M., et al. (2018). When and why are pre-trained word embeddings useful for neural machine translation? pages 529–535, New Orleans, Louisiana. Assoc. for Comp. Linguistics.

Sennrich, R., Haddow, B., and Birch, A. (2015a). Improving neural machine translation models with monolingual data. CoRR, abs/1511.06709.

Sennrich, R., Haddow, B., and Birch, A. (2015b). Neural machine translation of rare words with subword units. CoRR, abs/1508.07909.

Sennrich, R. and Zhang, B. (2019). Revisiting low-resource neural machine translation: A case study. In Proc. of the 57th Annual Meeting of the Assoc. for Comp. Linguistics, pages 211–221, Florence, Italy. Assoc. for Comp. Linguistics.

Tiedemann, J. (2012). Parallel data, tools and interfaces in opus. In Proc. of the Eight International Conf. on Language Resources and Evaluation (LREC’12), Istanbul, Turkey. European Language Resources Assoc. (ELRA).

Tiedemann, J. (2020). The Tatoeba translation challenge realistic data sets for low resource and multilingual MT. CoRR, abs/2010.06354.

Vaswani, A., Shazeer, N., Parmar, N., et al. (2017). Attention is all you need. CoRR, abs/1706.03762.

Zoph, B., Yuret, D., May, J., and Knight, K. (2016). Transfer learning for low-resource neural machine translation. In Proc. of the 2016 Conf. on Empirical Methods in Natural Language, pages 1568–1575, Austin, Texas. Assoc. for Comp. Linguistics.
Publicado
29/11/2021
ESTRELLA, Arthur T.; SOUZA FILHO, João B. O.. Tackling neural machine translation in low-resource settings: a Portuguese case study. In: SIMPÓSIO BRASILEIRO DE TECNOLOGIA DA INFORMAÇÃO E DA LINGUAGEM HUMANA (STIL), 13. , 2021, Evento Online. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2021 . p. 275-282. DOI: https://doi.org/10.5753/stil.2021.17807.