Characterizing and understanding ensemble-based anomaly-detection

Gustavo de P. Avelar; Guilherme O. Campos; Wagner Meira Jr.

doi:10.5753/kdmile.2021.17473

Gustavo de P. Avelar UFMG http://orcid.org/0000-0002-6208-6159
Guilherme O. Campos UFMG http://orcid.org/0000-0003-1199-9529
Wagner Meira Jr. UFMG http://orcid.org/0000-0002-2614-2723

DOI: https://doi.org/10.5753/kdmile.2021.17473

Resumo

Anomaly Detection (AD) has grown in importance in recent years, as a result of an increasing digitalization of services and data storage, and abnormal behavior detection has become a key task. However, discovering abnormal data that is mixed with the huge amount of data available is a daunting problem and the efficacy of the current methods depends on a wide range of assumptions. One effective strategy for detecting anomalies is to combine multiple models, which are called "ensembles", but the factors that determine their performance are often hard to determine, making their calibration and improvement a challenging task. In this paper we address these problems by employing a four-step method for the characterization and understanding of ensemble-based anomaly-detection task. We start by characterizing several datasets and analyzing the factors that make it hard to detect their anomalies. We then evaluate to what extent existing algorithms are able to detect anomalies in the same datasets. On the basis of both analyses, we propose a stacking-based ensemble that outperformed a state-of-the-art baseline, Isolation Forest. Finally, we examine the benefits and drawbacks of our proposal.

Palavras-chave: anomaly detection, data mining, ensembles, machine learning, interpretability

Referências

Aggarwal, C. C. Outlier ensembles: Position paper. SIGKDD Explor. Newsl. 14 (2): 49–58, Apr., 2013.

Aggarwal, C. C. and Sathe, S. Theoretical foundations and algorithms for outlier ensembles. SIGKDD Explor. Newsl. 17 (1): 24–47, Sept., 2015.

Breunig, M. M., Kriegel, H.-P., Ng, R. T., and Sander, J. Lof: Identifying density-based local outliers. In Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data. SIGMOD ’00. Association for Computing Machinery, New York, NY, USA, pp. 93–104, 2000.

Campos, G., Zimek, A., and Meira Jr., W. An unsupervised boosting strategy for outlier detection ensembles. In Advances in Knowledge Discovery and Data Mining. Lecture Notes in Computer Science, vol. 10937. Springer, Germany, pp. 564–576, 2018. Pacific-Asia Conference on Knowledge Discovery and Data Mining, PAKDD; Conference date: 03-06-2018 Through 06-06-2018.

Campos, G. O., Zimek, A., Sander, J., Campello, R. J. G. B., Micenková, B., Schubert, E., Assent, I., and Houle, M. E. On the evaluation of unsupervised outlier detection: measures, datasets, and an empirical study. Data Mining and Knowledge Discovery 30 (4): 891–927, Jul, 2016.

Dietterich, T. G. Ensemble methods in machine learning. In International workshop on multiple classifier systems. Springer, Springer Berlin Heidelberg, Berlin, Heidelberg, pp. 1–15, 2000.

Dua, D. and Graff, C. UCI machine learning repository, 2017.

Hanley, J. A. and McNeil, B. J. The meaning and use of the area under a receiver operating characteristic (roc) curve. Radiology 143 (1): 29–36, 1982.

Hawkins, D. M. Identification of outliers. Vol. 11. Chapman and Hall London ; New York, London [u.a.], 1980.

He, Z., Xu, X., and Deng, S. Discovering cluster-based local outliers. Pattern Recognition Letters 24 (9): 1641-1650, 2003.

Hodge, V. and Austin, J. A survey of outlier detection methodologies. Artif. Intell. Rev. 22 (2): 85–126, Oct., 2004.

Jin, W., Tung, A. K. H., and Han, J. Mining top-n local outliers in large databases. In Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. KDD ’01. Association for Computing Machinery, New York, NY, USA, pp. 293–298, 2001.

Knorr, E. M. and Ng, R. T. Algorithms for mining distance-based outliers in large datasets. In Proceedings of the 24rd International Conference on Very Large Data Bases. VLDB ’98. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, pp. 392–403, 1998.

Kriegel, H.-P., Kroger, P., Schubert, E., and Zimek, A. Interpreting and unifying outlier scores. In Proceedings of the 2011 SIAM International Conference on Data Mining. SIAM, SIAM / Omnipress, Mesa, Arizona, USA, pp. 13–24, 2011.

Liu, F. T., Ting, K. M., and Zhou, Z.-H. Isolation forest. In Proceedings of the 2008 Eighth IEEE International Conference on Data Mining. ICDM ’08. IEEE Computer Society, USA, pp. 413–422, 2008.

Molnar, C. A guide for making black box models explainable. URL: https://christophm.github.io/interpretable-ml-book/ 1 (1): 1–303, 2018.

Ramaswamy, S., Rastogi, R., and Shim, K. Efficient algorithms for mining outliers from large data sets. In Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data. SIGMOD ’00. Association for Computing Machinery, New York, NY, USA, pp. 427–438, 2000.