Fast Distance-based Outlier Detection using GPUs

  • Fernando Mussel UFMG
  • Carlos Teixeira UFMG
  • Wagner Meira Jr. UFMG

Resumo


Outlier detection is an important area of data mining with many practical applications, such as credit card and insurance fraud detection, network intrusion detection, etc. Distance-based detection methods, such as ORCA and DIODE, have stood out due to their parametric-free nature and good scalability on large and high dimensional datasets. In this paper we propose, a new parallel algorithm based on ORCA, which is designed to run efficiently on GPUs (Graphical Process Units). Then we discuss the main challenges pertaining its implementation and how we addressed them, in order to take full advantage of the GPU’s parallel hardware. In our experimental analysis, we show that our algorithm not only is able to efficiently prune unnecessary distance computations, but can also achieve up to 162X speedup compared to state-of-the-art anomaly detection algorithms.

Referências

Ahmed Shamsul Arefin, Carlos Riveros, R. B. and Moscato, P. (2012). Gpu-fs-knn: A software tool for fast and scalable knn computation using gpus.

Alshawabkeh, M., Jang, B., and Kaeli, D. (2010). Accelerating the local outlier factor algorithm on a gpu for intrusion detection systems. In Proceedings of the 3rdWorkshop on General-Purpose Computation on Graphics Processing Units, GPGPU ’10, pages 104–110, New York, NY, USA. ACM.

Angiulli, F., Basta, S., Lodi, S., and Sartori, C. (2013). Fast outlier detection using a gpu. In High Performance Computing and Simulation (HPCS), 2013 International Conference on, pages 143–150. IEEE.

Angiulli, F., Basta, S., and Pizzuti, C. (2006). Distance-based detection and prediction of outliers. Knowledge and Data Engineering, IEEE Transactions on, 18(2):145–160.

Bay, S. D. and Schwabacher, M. (2003). Mining distance-based outliers in near linear time with randomization and a simple pruning rule. In Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD’03, pages 29–38, New York, NY, USA. ACM.

Garcia, V., Debreuve, E., Nielsen, F., and Barlaud, M. (2010). K-nearest neighbor search: Fast gpu-based implementations and application to high-dimensional feature matching. In Image Processing (ICIP), 2010 17th IEEE International Conference on, pages 3757–3760. IEEE.

Ghoting, A., Parthasarathy, S., and Otey, M. E. (2008). Fast mining of distance-based outliers in high-dimensional datasets. Data Min. Knowl. Discov., 16(3):349–364.

Gupta, M., Gao, J., Sun, Y., and Han, J. (2012). Integrating community matching and outlier detection for mining evolutionary community outliers. In Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’12, pages 859–867, New York, NY, USA. ACM.

Li, Y., Zhao, K., Chu, X., and Liu, J. (2010). Speeding up k-means algorithm by gpus. In Computer and Information Technology (CIT), 2010 IEEE 10th International Conference on, pages 115–122.

Orair, G. H., Teixeira, C. H. C., Meira, Jr., W., Wang, Y., and Parthasarathy, S. (2010). Distance-based outlier detection: consolidation and renewed bearing. Proc. VLDB Endow., 3(1-2):1469–1480.

Ramaswamy, S., Rastogi, R., and Shim, K. (2000). Efficient algorithms for mining outliers from large data sets. SIGMOD Rec., 29(2):427–438.

Wasif, M. and Narayanan, P. (2011). Scalable clustering using multiple gpus. In High Performance Computing (HiPC), 2011 18th International Conference on, pages 1–10.

Yankov, D., Keogh, E., and Rebbapragada, U. (2007). Disk aware discord discovery: Finding unusual time series in terabyte sized datasets. In Data Mining, 2007. ICDM 2007. Seventh IEEE International Conference on, pages 381–390.
Publicado
20/07/2015
MUSSEL, Fernando; TEIXEIRA, Carlos; MEIRA JR., Wagner. Fast Distance-based Outlier Detection using GPUs. In: CONCURSO DE TRABALHOS DE INICIAÇÃO CIENTÍFICA DA SBC (CTIC-SBC), 34. , 2015, Recife. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2015 . p. 41-50.