Efficient Density-Based Models for Multiple Machine Learning Solutions over Large Datasets


Unsupervised and semi-supervised machine learning is very advantageous in data-intensive applications. Density-based hierarchical clustering obtains a detailed description of the structures of clusters and outliers in a dataset through density functions. The resulting hierarchy of these algorithms can be derived from a minimal spanning tree whose edges quantify the maximum density required for the connected data to characterize clusters, given a minimum number of objects, MinPts, in a given neighborhood. CORE-SG is a powerful spanning graph capable of deriving multiple hierarchical solutions with different densities with computational performance far superior to its predecessors. However, density-based algorithms use pairwise similarity calculations, which leads such algorithms to an asymptotic complexity of O(n2) for n objects in the dataset, impractical in scenarios with large amounts of data. This article enables hierarchical machine learning models based on density by reducing the computational cost with the help of Data Bubbles, focusing on clustering and outlier detection. It presents a study of the impact of data summarization on the quality of unsupervised models with multiple densities and the gain in computational performance. We provide scalability for several machine learning methods based on these models to handle large volumes of data without a significant loss in the resulting quality, enabling potential new applications like density-based data stream clustering.

BATISTA, Natanael F. Dacioli; NUNES, Bruno Leonel; NALDI, Murilo Coelho. Efficient Density-Based Models for Multiple Machine Learning Solutions over Large Datasets. In: BRAZILIAN CONFERENCE ON INTELLIGENT SYSTEMS (BRACIS), 12. , 2023, Belo Horizonte/MG. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2023 . p. 48-62. ISSN 2643-6264.