Análise de Performance dos Modelos Gerais de Aprendizado de Máquina Pré-Treinados: BERT vs DistilBERT

Rafael Silva Barbon; Ademar Takeo Akabane

doi:10.5753/sbrc_estendido.2022.223391

Rafael Silva Barbon PUC-Campinas
Ademar Takeo Akabane PUC-Campinas

DOI: https://doi.org/10.5753/sbrc_estendido.2022.223391

Resumo

Modelos de aprendizado de máquina (AM) vêm sendo amplamente utilizados devido à elevada quantidade de dados produzidos diariamente. Dentre eles, destaca-se os modelos pré-treinados devido a sua eficácia, porém estes normalmente demandam um elevado custo computacional na execução de sua tarefa. A fim de contornar esse problema, técnicas de compressão de redes neurais vem sendo aplicadas para produzir modelos pré-treinados menores sem comprometer a acurácia. Com isso, neste trabalho foram utilizados dois diferentes modelos pré-treinados de AM: BERT e DistilBERT na classificação de texto. Os resultados apontam que modelos menores apresentam bons resultados quando comparados com seus equivalentes maiores.

Referências

Bucilua, C., Caruana, R., and Niculescu-Mizil, A. (2006). Model compression. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 535–541.

Chen, M., Tworek, J., Jun, H., Yuan, Q., Pinto, H. P. d. O., Kaplan, J., Edwards, H., Burda, Y., Joseph, N., Brockman, G., et al. (2021). Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374.

Dale, R. (2021). Gpt-3: What’s it good for? Natural Language Engineering, 27(1):113–118.

Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.

Greene, D. and Cunningham, P. (2006). Practical solutions to the problem of diagonal dominance in kernel document clustering. In Proc. 23rd International Conference on Machine learning (ICML’06), pages 377–384. ACM Press.

Hinton, G., Vinyals, O., Dean, J., et al. (2015). Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2(7).

Luong, M.-T., Pham, H., and Manning, C. D. (2015). Effective approaches to attentionbased neural machine translation. arXiv preprint arXiv:1508.04025.

Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., et al. (2019). Language models are unsupervised multitask learners. OpenAI blog, 1(8):9.

Refaeilzadeh, P., Tang, L., and Liu, H. (2009). Cross-validation. Encyclopedia of database systems, 5:532–538.

Sanh, V., Debut, L., Chaumond, J., and Wolf, T. (2019). Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108.

Simaki, V., Paradis, C., Skeppstedt, M., Sahlgren, M., Kucher, K., and Kerren, A. (2020). Annotating speaker stance in discourse: the brexit blog corpus. Corpus Linguistics and Linguistic Theory, 16(2):215–248.

Statista Research Department (2018). Amount of data created, consumed, and stored 2010-2025. https://www.statista.com/statistics/871513/worldwide-data-created/, Último acesso: 08/04/2022.