A biased sampling method for applying DBSCAN
Abstract
The DBSCAN algorithm is a traditional density-based clustering method. This algorithm allows to identify groups of different topological formats and also to discard isolated noises. The DBSCAN usually presents good results, however, it performs several calculations of distances in the clustering process, causing low efficiency, and its application in large datasets is not recommended. This work presents a new sampling method that makes it possible to apply DBSCAN to a reduced set of examples, in order to approximate the original result of the algorithm applied to the entire dataset. Therefore, the method makes it possible to execute large datasets with results similar to DBSCAN on entire dataset, however, with greater computational efficiency.
References
El-Sonbaty, Yasser, M. A. I. and Farouk, M. (2004). An efficient density based clustering algorithm for large databases. 16th IEEE international conference on tools with artificial intelligence. IEEE.
ESTER, M. e. a. (1996). A density-based algorithm for discovering clusters in large spatial databases with noise. Kdd, pages 226–231.
Hartigan, J. A. (1975). Clustering algorithms. John Wiley & Sons, Inc.
HUBERT, Lawrence; ARABIE, P. (1985). Comparing partitions. Journal of classification, 2(1):193–218.
KDD. 2014 sigkdd test of time award. Disponı́vel em http://www.kdd.org/News/ view/2014-sigkdd-test-of-time-award (07/07/2020).
LUCHI, Diego; RODRIGUES, A. L. V. F. M. (2019). Sampling approaches for applying dbscan to large datasets. Pattern Recognition Letters, 117:90–96.
MacQueen, J. (1967). Some methods for classification and analysis of multivariate observations. In Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, 1(14):281–297.
Rand, W. M. (1971). Objective criteria for the evaluation of clustering methods. Journal of the American Statistical association 66.336, pages 846–850.
ROS, Frédéric; GUILLAUME, S. (2016). Dendis: A new density-based sampling for clustering algorithm. Expert Systems with Applications, 56:349–359.
ROS, Frédéric; GUILLAUME, S. (2017). Dides: a fast and effective sampling for clusring algorithm. Knowledge and information systems, 50(2):543–568.
ROS, Frédéric; GUILLAUME, S. (2018). Protras: a probabilistic traversing sampling algorithm. Expert Systems with Applications, 105:65–76.
SIBSON, R. (1973). Slink: an optimally efficient algorithm for the single-link cluster method. The computer journal, 16(1):30–34.
VISWANATH, P.; BABU, V. S. (2009). Rough-dbscan: A fast hybrid density based clustering method for large data sets. Pattern Recognition Letters, 30(16):1477–1488.
Wang, X. and Hamilton, H. J. (2003). Dbrs: a density-based spatial clustering method with random sampling. Pacific-Asia Conference on Knowledge Discovery and Data Mining. Springer, Berlin, Heidelberg.
Xu R, W. D. (2005). Survey of clustering algorithms. IEEE Transactions on neural networks, 16(3):645–78.
ZHANG, Tian; RAMAKRISHNAN, R. L. M. (1996). Birch: an efficient data clustering method for very large databases. ACM Sigmod Record, 25(2):103–114.
Chen, Yewang, e. a. (2018). A fast clustering algorithm based on pruning unnecessary distance computations in dbscan for high-dimensional data. Pattern Recognition 83, pages 375–387.
El-Sonbaty, Yasser, M. A. I. and Farouk, M. (2004). An efficient density based clustering algorithm for large databases. 16th IEEE international conference on tools with artificial intelligence. IEEE.
ESTER, M. e. a. (1996). A density-based algorithm for discovering clusters in large spatial databases with noise. Kdd, pages 226–231.
Hartigan, J. A. (1975). Clustering algorithms. John Wiley & Sons, Inc.
HUBERT, Lawrence; ARABIE, P. (1985). Comparing partitions. Journal of classification, 2(1):193–218.
KDD. 2014 sigkdd test of time award. Disponı́vel em http://www.kdd.org/News/ view/2014-sigkdd-test-of-time-award (07/07/2020).
LUCHI, Diego; RODRIGUES, A. L. V. F. M. (2019). Sampling approaches for applying dbscan to large datasets. Pattern Recognition Letters, 117:90–96.
MacQueen, J. (1967). Some methods for classification and analysis of multivariate observations. In Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, 1(14):281–297.
Rand, W. M. (1971). Objective criteria for the evaluation of clustering methods. Journal of the American Statistical association 66.336, pages 846–850.
ROS, Frédéric; GUILLAUME, S. (2016). Dendis: A new density-based sampling for clustering algorithm. Expert Systems with Applications, 56:349–359.
ROS, Frédéric; GUILLAUME, S. (2017). Dides: a fast and effective sampling for clusring algorithm. Knowledge and information systems, 50(2):543–568.
ROS, Frédéric; GUILLAUME, S. (2018). Protras: a probabilistic traversing sampling algorithm. Expert Systems with Applications, 105:65–76.
SIBSON, R. (1973). Slink: an optimally efficient algorithm for the single-link cluster method. The computer journal, 16(1):30–34.
VISWANATH, P.; BABU, V. S. (2009). Rough-dbscan: A fast hybrid density based clustering method for large data sets. Pattern Recognition Letters, 30(16):1477–1488.
Wang, X. and Hamilton, H. J. (2003). Dbrs: a density-based spatial clustering method with random sampling. Pacific-Asia Conference on Knowledge Discovery and Data Mining. Springer, Berlin, Heidelberg.
Xu R, W. D. (2005). Survey of clustering algorithms. IEEE Transactions on neural networks, 16(3):645–78.
ZHANG, Tian; RAMAKRISHNAN, R. L. M. (1996). Birch: an efficient data clustering method for very large databases. ACM Sigmod Record, 25(2):103–114.
