Detection and Censorship of Offensive Language in Extended Texts in Portuguese
Resumo
This article addresses the problem of detecting and censoring offensive language in extensive Brazilian Portuguese texts on the web. This paper proposes a pipeline for classifying and censoring extensive texts, focusing on comments, posts, and articles using NLP techniques. The results include an in-depth review of current methods for offensive content classification in Portuguese and the implementation of a BERTimbau-based pipeline for offense detection. This work represents a significant advancement in the state-of-the-art NLP in Portuguese, promoting safer and more respectful online environments for users, especially children.
Referências
Devlin, J. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. [link] DOI: 10.48550/arXiv.1810.04805
Economist (2019). Increasing numbers of children have internet addiction – how worried should parents really be? [link]. (Accessed on 10/10/2024).
Hajibabaee, P., Malekzadeh, M., Ahmadi, M., Heidari, M., Esmaeilzadeh, A., Abdolazimi, R., and Jones, J. H. (2022). Offensive language detection on social media based on text classification. 2022 IEEE 12th Annual Computing and Communication Workshop and Conference (CCWC), pages 0092–0098. [link] DOI: 10.1109/CCWC54503.2022.9720804
Honnibal, M., Montani, I., Van Landeghem, S., and Boyd, A. (2020). spaCy: Industrial-strength Natural Language Processing in Python. [link] DOI: 10.5281/zenodo.1212303
Husain, F. and Uzuner, O. (2021). A survey of offensive language detection for the arabic language. ACM Transactions on Asian and Low-Resource Language Information Processing (TALLIP), 20(1):1–44. [link] DOI: 10.1145/3421504
Leite, J. A., Silva, D., Bontcheva, K., and Scarton, C. (2020). Toxic language detection in social media for brazilian portuguese: New dataset and multilingual analysis. In Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing, pages 914–924. [link]
Leray, W. (2023). Série de harry potter? envolvimento de j.k. rowling divide fãs. [link]. (Accessed on 10/10/2024).
Martins, T. (2022). Chico buarque dá comida aos censores - senso incomum. [link]. (Accessed on 10/10/2024).
Monteiro, E. (2023). Caso bruno e dom: justiça decide levar amarildo e outros dois réus a júri popular | amazonas | g1. [link]. (Accessed on 10/10/2024).
Pelle, R. P. and Moreira, V. P. (2017). Offensive comments in the brazilian web: a dataset and baseline results. In Anais do VI Brazilian Workshop on Social Network Analysis and Mining. SBC. [link] DOI: 10.5753/brasnam.2017.3260
Souza, F., Nogueira, R., and Lotufo, R. (2020). Bertimbau: pretrained bert models for brazilian portuguese. In Intelligent Systems: 9th Brazilian Conference, BRACIS 2020, Rio Grande, Brazil, October 20–23, 2020, Proceedings, Part I 9, pages 403–417. Springer. [link] DOI: 10.1007/978-3-030-61377-8_28
Trajano, D., Bordini, R. H., and Vieira, R. (2023). Olid-br: offensive language identification dataset for brazilian portuguese. Language Resources and Evaluation, pages 1–27. [link] DOI: 10.1007/s10579-023-09657-0
Trielli, L. (2021). Escócia: estupradores que se declararem mulher serão colocados em prisões femininas - senso incomum. [link]. (Accessed on 10/10/2024).
Vargas, F., Carvalho, I., Rodrigues de Góes, F., Pardo, T., and Benevenuto, F. (2022). HateBR: A large expert annotated corpus of Brazilian Instagram comments for offensive language and hate speech detection. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 7174–7183, Marseille, France. European Language Resources Association. [link]
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. (2017). Attention is all you need. Advances in neural information processing systems, 30. [link]
Wagner Filho, J. A., Wilkens, R., Idiart, M., and Villavicencio, A. (2018). The brwac corpus: a new open resource for brazilian portuguese. In Proceedings of the eleventh international conference on language resources and evaluation (LREC 2018). [link]