Anomaly Detection in Text Data: A Semi-Supervised Approach Applied to the Portuguese Domain

Resumo


Anomaly detection, driven by advancements in machine learning and deep learning, has gained significant importance across various fields. However, its application to unstructured textual data, particularly in Portuguese, remains underexplored. In textual analysis, these techniques are crucial for detecting deviations within text collections. This paper investigates state-of-the-art methods for anomaly detection in Portuguese text corpora and introduces a new, flexible loss function designed to enhance detection across different contamination levels. By evaluating these methods on benchmark datasets, specifically in the contexts of hate speech detection and sentiment analysis, we address existing challenges and contribute to the development of more effective anomaly detection techniques for Portuguese text data.

Palavras-chave: Anomaly detection, Textual anomaly, Transformers, Pre-trained models, Natural Language Processing

Referências

Boutalbi, K., Loukil, F., Verjus, H., Telisson, D., and Salamatian, K. (2023). Machine learning for text anomaly detection: A systematic review. In 2023 IEEE 47th Annual Computers, Software, and Applications Conference (COMPSAC), pages 1319–1324. DOI: 10.1109/COMPSAC57700.2023.00200

Chandola, V., Banerjee, A., and Kumar, V. (2009). Anomaly detection: A survey. ACM Computing Surveys, 41(3):71–97. DOI: 10.1145/1541880.1541882

Edgeworth, F. Y. (1887). Xli. on discordant observations. Philosophical Magazine Series 1, 23:364–375. [link]

Leite, J. A., Silva, D., Bontcheva, K., and Scarton, C. (2020). Toxic language detection in social media for Brazilian Portuguese: New dataset and multilingual analysis. In Wong, K.-F., Knight, K., and Wu, H., editors, Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing, pages 914–924, Suzhou, China. Association for Computational Linguistics. [link]

Pang, G., Shen, C., and van den Hengel, A. (2019). Deep anomaly detection with deviation networks. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD ’19, page 353–362, New York, NY, USA. Association for Computing Machinery. DOI: 10.1145/3292500.3330871

Reimers, N. and Gurevych, I. (2019). Sentence-BERT: Sentence embeddings using Siamese BERT-networks. In Inui, K., Jiang, J., Ng, V., and Wan, X., editors, Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3982–3992, Hong Kong, China. Association for Computational Linguistics. [link] DOI: 10.18653/v1/D19-1410

Ruff, L., Vandermeulen, R., Goernitz, N., Deecke, L., Siddiqui, S. A., Binder, A., Müller, E., and Kloft, M. (2018). Deep one-class classification. In Dy, J. and Krause, A., editors, Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 4393–4402. PMLR. [link]

Ruff, L., Vandermeulen, R. A., Görnitz, N., Binder, A., Müller, E., Müller, K.-R., and Kloft, M. (2020). Deep semi-supervised anomaly detection. In International Conference on Learning Representations. [link]

Sousa, R. F. d., Brum, H. B., and Nunes, M. d. G. V. (2019). A bunch of helpfulness and sentiment corpora in brazilian portuguese. In Symposium in Information and Human Language Technology - STIL. SBC.

Souza, F., Nogueira, R., and Lotufo, R. (2020). Bertimbau: Pretrained bert models for brazilian portuguese. In Cerri, R. and Prati, R. C., editors, Intelligent Systems, pages 403–417, Cham. Springer International Publishing.

Xu, H., Pang, G., Wang, Y., and Wang, Y. (2023a). Deep isolation forest for anomaly detection. IEEE Transactions on Knowledge and Data Engineering, 35(12):12591–12604.

Xu, Y., Gabor, K., Milleret, J., and Segond, F. (2023b). Comparative analysis of anomaly detection algorithms in text data. pages 1234–1245 DOI: 10.26615/978-954-452-092-2_131
Publicado
17/11/2024
MAIA, Fabio Masaracchia; COSTA, Anna Helena Reali. Anomaly Detection in Text Data: A Semi-Supervised Approach Applied to the Portuguese Domain. In: SIMPÓSIO BRASILEIRO DE TECNOLOGIA DA INFORMAÇÃO E DA LINGUAGEM HUMANA (STIL), 15. , 2024, Belém/PA. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2024 . p. 288-293. DOI: https://doi.org/10.5753/stil.2024.245357.