Proposal and Linguistic Assessment of Data Augmentation Techniques
Abstract
In Natural Language Processing (NLP), data augmentation consists of creating artificial training data for machine learning through textual transformations, aiming to improve the model’s generalization capabilities and its performance in a range of downstream NLP tasks. Most studies on data augmentation techniques are evaluated based on the performance of the trained model with the artificially generated texts directly in the target task, without concern for linguistically evaluating the quality of the created texts. In this study, we propose two data augmentation techniques, evaluate the linguistic quality of the transformed texts, and demonstrate that the texts are well-constructed linguistically.
References
Bechara, E. (2001). Moderna Gramática Portuguesa. Lucerna.
Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. (2020). Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
Costa, J. (2008). O Advérbio em Português Europeu. Colibri, Lisboa.
de Marneffe, M.-C., Manning, C. D., Nivre, J., and Zeman, D. (2021). Universal Dependencies. Computational Linguistics, 47(2):255–308. https://doi.org/10.1162/coli_a_00402
Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics. https://doi.org/10.18653/v1/N19-1423
Krishna, K., Wieting, J., and Iyyer, M. (2020). Reformulating unsupervised style transfer as paraphrase generation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 737–762, Online. Association for Computational Linguistics. https://doi.org/10.18653/v1/2020.emnlp-main.55
Lai, H., Toral, A., and Nissim, M. (2021). Generic resources are what you need: Style transfer tasks without task-specific parallel training data. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 4241–4254, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics. https://doi.org/10.18653/v1/2021.emnlp-main.349
Li, J., Jia, R., He, H., and Liang, P. (2018). Delete, retrieve, generate: a simple approach to sentiment and style transfer. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2018, Volume 1 (Long Papers), pages 1865–1874. Association for Computational Linguistics. https://doi.org/10.18653/v1/N18-1169
Menezes, L., Paes, A., and Finatto, M. (2023). Abordagem baseada em aumento de dados para avaliação automática de leiturabilidade. Domínios de Linguagem, 17:e1721. https://doi.org/10.14393/DLv17a2023-21
Min, J., McCoy, R. T., Das, D., Pitler, E., and Linzen, T. (2020). Syntactic data augmentation increases robustness to inference heuristics. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 2339–2352, Online. Association for Computational Linguistics. https://doi.org/10.18653/v1/2020.acl-main.212
Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., and Sutskever, I. (2019). Language models are unsupervised multitask learners.
Şahin, G. G. and Steedman, M. (2018). Data augmentation via dependency tree morphing for low-resource languages. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 5004–5009, Brussels, Belgium. Association for Computational Linguistics. https://doi.org/10.18653/v1/D18-1545
Souza, F., Nogueira, R., and Lotufo, R. (2020). Bertimbau: Pretrained bert models for brazilian portuguese. In Cerri, R. and Prati, R. C., editors, Intelligent Systems, pages 403–417, Cham. Springer International Publishing. https://doi.org/10.1007/978-3-030-61377-8_28
Straka, M., Hajič, J., and Straková, J. (2016). UDPipe: Trainable pipeline for processing CoNLL-U files performing tokenization, morphological analysis, POS tagging and parsing. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), pages 4290–4297, Portorož, Slovenia. European Language Resources Association (ELRA).
Veloso, R. (2013). Gramática do Português, volume 2, chapter Advérbio e Sintagma Adverbial, pages 1569–1684. Fundação Calouste Gulbenkian.
Xu, W., Ritter, A., Dolan, B., Grishman, R., and Cherry, C. (2012). Paraphrasing for style. In Kay, M. and Boitet, C., editors, COLING 2012, 24th International Conference on Computational Linguistics, Proceedings of the Conference: Technical Papers, 8-15 December 2012, Mumbai, India, pages 2899–2914. Indian Institute of Technology Bombay.
