Um método de amostragem tendenciosa para aplicação do DBSCAN

Igor Ventorim; Diego Luchi; Flávio Varejão

doi:10.5753/eniac.2020.12129

Igor Ventorim UFES
Diego Luchi UFES
Flávio Varejão UFES

DOI: https://doi.org/10.5753/eniac.2020.12129

Resumo

O algoritmo DBSCAN é um método de agrupamento baseado em densidade. Permite identificar grupos de diferentes formatos topológicos e também descartar ruídos isolados. O DBSCAN normalmente apresenta bons resultados, no entanto, ele realiza diversos cálculos de distâncias no processo de agrupamento, possuindo baixa eficiência em grandes conjuntos de dados. Neste trabalho é apresentado um novo método amostral que possibilita aplicar o DBSCAN em um conjunto reduzido de exemplos, aproximando-se do resultado original do algoritmo aplicado sobre todo o conjunto de dados. Portanto, o método torna possível a execução de grandes conjuntos de dados com resultados semelhantes ao do DBSCAN na base de dados completa, porém, com maior eficiência.

Palavras-chave: Amostragem, análise de agrupamento, aprendizado não super- visionado.

Referências

Chen, Yewang, e. a. (2018). A fast clustering algorithm based on pruning unnecessary distance computations in dbscan for high-dimensional data. Pattern Recognition 83, pages 375–387.

El-Sonbaty, Yasser, M. A. I. and Farouk, M. (2004). An efficient density based clustering algorithm for large databases. 16th IEEE international conference on tools with artificial intelligence. IEEE.

ESTER, M. e. a. (1996). A density-based algorithm for discovering clusters in large spatial databases with noise. Kdd, pages 226–231.

Hartigan, J. A. (1975). Clustering algorithms. John Wiley & Sons, Inc.

HUBERT, Lawrence; ARABIE, P. (1985). Comparing partitions. Journal of classification, 2(1):193–218.

KDD. 2014 sigkdd test of time award. Disponı́vel em http://www.kdd.org/News/ view/2014-sigkdd-test-of-time-award (07/07/2020).

LUCHI, Diego; RODRIGUES, A. L. V. F. M. (2019). Sampling approaches for applying dbscan to large datasets. Pattern Recognition Letters, 117:90–96.

MacQueen, J. (1967). Some methods for classification and analysis of multivariate observations. In Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, 1(14):281–297.

Rand, W. M. (1971). Objective criteria for the evaluation of clustering methods. Journal of the American Statistical association 66.336, pages 846–850.

ROS, Frédéric; GUILLAUME, S. (2016). Dendis: A new density-based sampling for clustering algorithm. Expert Systems with Applications, 56:349–359.

ROS, Frédéric; GUILLAUME, S. (2017). Dides: a fast and effective sampling for clusring algorithm. Knowledge and information systems, 50(2):543–568.

ROS, Frédéric; GUILLAUME, S. (2018). Protras: a probabilistic traversing sampling algorithm. Expert Systems with Applications, 105:65–76.

SIBSON, R. (1973). Slink: an optimally efficient algorithm for the single-link cluster method. The computer journal, 16(1):30–34.

VISWANATH, P.; BABU, V. S. (2009). Rough-dbscan: A fast hybrid density based clustering method for large data sets. Pattern Recognition Letters, 30(16):1477–1488.

Wang, X. and Hamilton, H. J. (2003). Dbrs: a density-based spatial clustering method with random sampling. Pacific-Asia Conference on Knowledge Discovery and Data Mining. Springer, Berlin, Heidelberg.

Xu R, W. D. (2005). Survey of clustering algorithms. IEEE Transactions on neural networks, 16(3):645–78.

ZHANG, Tian; RAMAKRISHNAN, R. L. M. (1996). Birch: an efficient data clustering method for very large databases. ACM Sigmod Record, 25(2):103–114.

Chen, Yewang, e. a. (2018). A fast clustering algorithm based on pruning unnecessary distance computations in dbscan for high-dimensional data. Pattern Recognition 83, pages 375–387.

El-Sonbaty, Yasser, M. A. I. and Farouk, M. (2004). An efficient density based clustering algorithm for large databases. 16th IEEE international conference on tools with artificial intelligence. IEEE.

ESTER, M. e. a. (1996). A density-based algorithm for discovering clusters in large spatial databases with noise. Kdd, pages 226–231.

Hartigan, J. A. (1975). Clustering algorithms. John Wiley & Sons, Inc.

HUBERT, Lawrence; ARABIE, P. (1985). Comparing partitions. Journal of classification, 2(1):193–218.

KDD. 2014 sigkdd test of time award. Disponı́vel em http://www.kdd.org/News/ view/2014-sigkdd-test-of-time-award (07/07/2020).

LUCHI, Diego; RODRIGUES, A. L. V. F. M. (2019). Sampling approaches for applying dbscan to large datasets. Pattern Recognition Letters, 117:90–96.

MacQueen, J. (1967). Some methods for classification and analysis of multivariate observations. In Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, 1(14):281–297.

Rand, W. M. (1971). Objective criteria for the evaluation of clustering methods. Journal of the American Statistical association 66.336, pages 846–850.

ROS, Frédéric; GUILLAUME, S. (2016). Dendis: A new density-based sampling for clustering algorithm. Expert Systems with Applications, 56:349–359.

ROS, Frédéric; GUILLAUME, S. (2017). Dides: a fast and effective sampling for clusring algorithm. Knowledge and information systems, 50(2):543–568.

ROS, Frédéric; GUILLAUME, S. (2018). Protras: a probabilistic traversing sampling algorithm. Expert Systems with Applications, 105:65–76.

SIBSON, R. (1973). Slink: an optimally efficient algorithm for the single-link cluster method. The computer journal, 16(1):30–34.

VISWANATH, P.; BABU, V. S. (2009). Rough-dbscan: A fast hybrid density based clustering method for large data sets. Pattern Recognition Letters, 30(16):1477–1488.

Wang, X. and Hamilton, H. J. (2003). Dbrs: a density-based spatial clustering method with random sampling. Pacific-Asia Conference on Knowledge Discovery and Data Mining. Springer, Berlin, Heidelberg.

Xu R, W. D. (2005). Survey of clustering algorithms. IEEE Transactions on neural networks, 16(3):645–78.

ZHANG, Tian; RAMAKRISHNAN, R. L. M. (1996). Birch: an efficient data clustering method for very large databases. ACM Sigmod Record, 25(2):103–114.