Learning with Few: A Comparative Study of Multilingual Text Anomaly Detection

Fabio Masaracchia Maia; Anna Helena Reali Costa

doi:10.5753/stil.2025.37828

Fabio Masaracchia Maia USP
Anna Helena Reali Costa USP

DOI: https://doi.org/10.5753/stil.2025.37828

Resumo

Detecting anomalies in textual data is a critical task in domains such as content moderation, fraud detection, and risk monitoring. However, this remains challenging due to the semantic complexity of language and the scarcity of labeled anomalies in real-world scenarios. This paper presents a comprehensive benchmark study that integrates multiple perspectives: representation strategies, learning paradigms, and linguistic diversity. We evaluate unsupervised and semi-supervised models—including deep learning approaches—across datasets in both Portuguese and English. Additionally, we assess the impact of sentence embeddings, comparing multilingual encoders with language-specific models. Our findings show that representation choice and limited supervision strongly influence model performance in few-shot settings.

Referências

Boutalbi, K., Loukil, F., Verjus, H., Telisson, D., and Salamatian, K. (2023). Machine learning for text anomaly detection: A systematic review. In 2023 IEEE 47th Annual Computers, Software, and Applications Conference (COMPSAC), pages 1319–1324.

Breunig, M. M., Kriegel, H.-P., Ng, R. T., and Sander, J. (2000). Lof: identifying density-based local outliers. In Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, SIGMOD ’00, page 93–104, New York, NY, USA. Association for Computing Machinery.

Chandola, V., Banerjee, A., and Kumar, V. (2009). Anomaly detection: A survey. ACM Computing Surveys, 41(3):71–97.

Conneau, A., Khandelwal, K., Goyal, N., Chaudhary, V., Wenzek, G., Guzmán, F., Grave, E., Ott, M., Zettlemoyer, L., and Stoyanov, V. (2020). Unsupervised cross-lingual representation learning at scale. In Jurafsky, D., Chai, J., Schluter, N., and Tetreault, J., editors, Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8440–8451, Online. Association for Computational Linguistics.

Garcia, K., Shiguihara, P., and Berton, L. (2024). Breaking news: Unveiling a new dataset for portuguese news classification and comparative analysis of approaches. PLOS ONE, 19(1):1–15.

Gomes, L., Branco, A., Silva, J. a., Rodrigues, J. a., and Santos, R. (2024). Open sentence embeddings for portuguese with the serafim pt* encoders family. In Progress in Artificial Intelligence: 23rd EPIA Conference on Artificial Intelligence, EPIA 2024, Viana Do Castelo, Portugal, September 3–6, 2024, Proceedings, Part III, page 267–279, Berlin, Heidelberg. Springer-Verlag.

Hinton, G. and Salakhutdinov, R. (2006). Reducing the dimensionality of data with neural networks. Science (New York, N.Y.), 313:504–7.

Kingma, D. P. and Welling, M. (2013). Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114.

Lang, K. (1995). Newsweeder: Learning to filter netnews. [link]. CMU, unpublished manuscript.

Leite, J. A., Silva, D. F., Bontcheva, K., and Scarton, C. (2020). Toxic language detection in social media for brazilian portuguese: New dataset and multilingual analysis. CoRR, abs/2010.04543.

Liang, Y., Zhao, Y., Hu, Y., Li, Z., Liu, W., Akoglu, L., and Ding, B. (2018). Learning representations of ultrahigh-dimensional data for random distance-based outlier detection. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 735–744. ACM.

Liu, F. T., Ting, K. M., and Zhou, Z.-H. (2008). Isolation forest. In 2008 Eighth IEEE International Conference on Data Mining, pages 413–422.

Maia, F. and Costa, A. H. (2024). Anomaly detection in text data: A semi-supervised approach applied to the portuguese domain. In Anais do XV Simpósio Brasileiro de Tecnologia da Informação e da Linguagem Humana, pages 288–293, Porto Alegre, RS, Brasil. SBC.

Mann, H. B. and Whitney, D. R. (1947). On a test of whether one of two random variables is stochastically larger than the other. The Annals of Mathematical Statistics, 18(1):50–60.

Manolache, A., Brad, F., and Burceanu, E. (2021). DATE: Detecting anomalies in text via self-supervision of transformers. In Toutanova, K., Rumshisky, A., Zettlemoyer, L., Hakkani-Tur, D., Beltagy, I., Bethard, S., Cotterell, R., Chakraborty, T., and Zhou, Y., editors, Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 267–277, Online. Association for Computational Linguistics.

Munoz-Galeano, N., Borchmann, L., and Reimers, N. (2021). Date: Detecting anomalies in text via self-supervision of transformers. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 282–293. ACL.

Pang, G., Shen, C., Cao, L., and Hengel, A. V. D. (2021). Deep learning for anomaly detection: A review. ACM Comput. Surv., 54(2).

Pang, G., Shen, C., and van den Hengel, A. (2019). Deep anomaly detection with deviation networks. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD ’19, page 353–362, New York, NY, USA. Association for Computing Machinery.

Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., and Duchesnay, E. (2011). Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830.

Portuguese Tweets Dataset (2018). Portuguese tweets for sentiment analysis. [link]. Accessed: 2025-09-03.

Reimers, N. and Gurevych, I. (2019). Sentence-BERT: Sentence embeddings using Siamese BERT-networks. In Inui, K., Jiang, J., Ng, V., and Wan, X., editors, Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3982–3992, Hong Kong, China. Association for Computational Linguistics.

Rosenthal, S., Farra, N., and Nakov, P. (2017). Semeval-2017 task 4: Sentiment analysis in twitter. In Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017), pages 502–518.

Ruff, L., Kauffmann, J., Vandermeulen, R., Montavon, G., Samek, W., Kloft, M., and Müller, K.-R. (2019a). Self-attentive, multi-context one-class classification for unsupervised anomaly detection on text. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4061–4071. ACL.

Ruff, L., Vandermeulen, R., Goernitz, N., Deecke, L., Siddiqui, S. A., Binder, A., Müller, E., and Kloft, M. (2018). Deep one-class classification. In Dy, J. and Krause, A., editors, Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 4393–4402. PMLR.

Ruff, L., Vandermeulen, R. A., Görnitz, N., Binder, A., Müller, E., and Kloft, M. (2020). Deep semi-supervised anomaly detection. In International Conference on Learning Representations.

Ruff, L., Zemlyanskiy, Y., Vandermeulen, R., Schnake, T., and Kloft, M. (2019b). Self-attentive, multi-context one-class classification for unsupervised anomaly detection on text. In Korhonen, A., Traum, D., and Màrquez, L., editors, Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4061–4071, Florence, Italy. Association for Computational Linguistics.

Schölkopf, B., Platt, J. C., Shawe-Taylor, J., Smola, A. J., and Williamson, R. C. (2001). Estimating the support of a high-dimensional distribution. Neural Computation, 13(7):1443–1471.

Sharma, R. (2018). Twitter sentiment analysis for hate speech detection. [link]. Accessed: 2025-09-03.

Souza, F., Nogueira, R., and Lotufo, R. (2020). Bertimbau: Pretrained bert models for brazilian portuguese. In Intelligent Systems: 9th Brazilian Conference, BRACIS 2020, Rio Grande, Brazil, October 20–23, 2020, Proceedings, Part I, page 403–417, Berlin, Heidelberg. Springer-Verlag.

Tax, D. M. and Duin, R. P. (2004). Support vector data description. Machine Learning, 54(1):45–66.

Wilcoxon, F. (1945). Individual comparisons by ranking methods. Biometrics Bulletin, 1(6):80–83.

Xu, H. (2022). Deepod: A benchmarking framework for deep outlier detection. [link].

Xu, Y., Gábor, K., Milleret, J., and Segond, F. (2023). Comparative analysis of anomaly detection algorithms in text data. In Mitkov, R. and Angelova, G., editors, Proceedings of the 14th International Conference on Recent Advances in Natural Language Processing, pages 1234–1245, Varna, Bulgaria. INCOMA Ltd., Shoumen, Bulgaria.

Zhang, X., Zhao, J., and LeCun, Y. (2015). Character-level convolutional networks for text classification. In Proceedings of the 29th International Conference on Neural Information Processing Systems Volume 1, NIPS’15, page 649–657, Cambridge, MA, USA. MIT Press.

Zhao, Y. and Hryniewicki, M. K. (2018). Xgbod: Improving supervised outlier detection with unsupervised representation learning. In 2018 International Joint Conference on Neural Networks (IJCNN), pages 1–8.

Zhao, Y., Nasrullah, Z., and Li, Z. (2019). Pyod: A python toolbox for scalable outlier detection. Journal of Machine Learning Research, 20(96):1–7.