Proposta e Avaliação Linguística de Técnicas de Aumento de Dados
Resumo
Em Processamento de Linguagem Natural (PLN), aumento de dados consiste na criação artificial de dados para treinamento de modelos de aprendizado de máquina através de transformações nos textos, visando aumentar a capacidade de generalização de modelos e melhorar o desempenho em diversas tarefas de PLN. A maioria dos estudos sobre técnicas de aumento de dados são avaliados pelo desempenho do modelo treinado com os textos artificiais diretamente na tarefa alvo, não se preocupando em avaliar linguisticamente a qualidade dos textos criados. Nesse estudo, propomos duas técnicas de aumento de dados, avaliamos a qualidade linguística dos textos transformados e mostramos que os textos são linguisticamente bem construídos.
Referências
Bechara, E. (2001). Moderna Gramática Portuguesa. Lucerna.
Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. (2020). Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
Costa, J. (2008). O Advérbio em Português Europeu. Colibri, Lisboa.
de Marneffe, M.-C., Manning, C. D., Nivre, J., and Zeman, D. (2021). Universal Dependencies. Computational Linguistics, 47(2):255–308. https://doi.org/10.1162/coli_a_00402
Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics. https://doi.org/10.18653/v1/N19-1423
Krishna, K., Wieting, J., and Iyyer, M. (2020). Reformulating unsupervised style transfer as paraphrase generation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 737–762, Online. Association for Computational Linguistics. https://doi.org/10.18653/v1/2020.emnlp-main.55
Lai, H., Toral, A., and Nissim, M. (2021). Generic resources are what you need: Style transfer tasks without task-specific parallel training data. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 4241–4254, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics. https://doi.org/10.18653/v1/2021.emnlp-main.349
Li, J., Jia, R., He, H., and Liang, P. (2018). Delete, retrieve, generate: a simple approach to sentiment and style transfer. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2018, Volume 1 (Long Papers), pages 1865–1874. Association for Computational Linguistics. https://doi.org/10.18653/v1/N18-1169
Menezes, L., Paes, A., and Finatto, M. (2023). Abordagem baseada em aumento de dados para avaliação automática de leiturabilidade. Domínios de Linguagem, 17:e1721. https://doi.org/10.14393/DLv17a2023-21
Min, J., McCoy, R. T., Das, D., Pitler, E., and Linzen, T. (2020). Syntactic data augmentation increases robustness to inference heuristics. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 2339–2352, Online. Association for Computational Linguistics. https://doi.org/10.18653/v1/2020.acl-main.212
Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., and Sutskever, I. (2019). Language models are unsupervised multitask learners.
Şahin, G. G. and Steedman, M. (2018). Data augmentation via dependency tree morphing for low-resource languages. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 5004–5009, Brussels, Belgium. Association for Computational Linguistics. https://doi.org/10.18653/v1/D18-1545
Souza, F., Nogueira, R., and Lotufo, R. (2020). Bertimbau: Pretrained bert models for brazilian portuguese. In Cerri, R. and Prati, R. C., editors, Intelligent Systems, pages 403–417, Cham. Springer International Publishing. https://doi.org/10.1007/978-3-030-61377-8_28
Straka, M., Hajič, J., and Straková, J. (2016). UDPipe: Trainable pipeline for processing CoNLL-U files performing tokenization, morphological analysis, POS tagging and parsing. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), pages 4290–4297, Portorož, Slovenia. European Language Resources Association (ELRA).
Veloso, R. (2013). Gramática do Português, volume 2, chapter Advérbio e Sintagma Adverbial, pages 1569–1684. Fundação Calouste Gulbenkian.
Xu, W., Ritter, A., Dolan, B., Grishman, R., and Cherry, C. (2012). Paraphrasing for style. In Kay, M. and Boitet, C., editors, COLING 2012, 24th International Conference on Computational Linguistics, Proceedings of the Conference: Technical Papers, 8-15 December 2012, Mumbai, India, pages 2899–2914. Indian Institute of Technology Bombay.