Large-scale Translation to Enable Response Selection in Low Resource Languages: A COVID-19 Chatbot Experiment

  • Lucas Almeida Aguiar Universidade Estadual do Ceará (UECE)
  • Lívia Almada Cruz Universidade Federal do Ceará (UFC)
  • Ticiana L. Coelho da Silva Universidade Federal do Ceará (UFC)
  • Rafael Augusto Ferreira do Carmo Universidade Federal do Ceará (UFC)
  • Matheus Henrique Esteves Paixao Universidade Estadual do Ceará (UECE)


Natural Language Processing for Low Resource Languages is challenging. The lack of large-scale datasets affects the performance of data-hungry algorithms. To overcome this, we employ data augmentation to enlarge the training data for the task of response selection in multi-turn retrieval-based chatbots. We automatically translated a large-scale English dataset to Brazilian Portuguese (PT_BR) and used it to train a deep neural network. For a COVID-19 chatbot system, our results show that the combination of training with the translated dataset followed by a fine-tuning with the context-specific dataset provides the best results in terms of recall for all studied models. In addition, we make available the translated large-scale PT_BR dataset.

Palavras-chave: natural language processing, automatic translation, multi-turn retrieval-based chatbot, low resource language


