Dual-Metric Clustering for Multivariate Time Series: KMeans with DTW and QuadTree with Entropy

  • Samuel R. Torres Laboratório Nacional de Computação Científica (LNCC)
  • Raphael Saldanha Institut national de recherche en sciences et technologies du numérique (INRIA)
  • Rocío Zorrilla Laboratório Nacional de Computação Científica (LNCC)
  • Vitor Ribeiro Laboratório Nacional de Computação Científica (LNCC)
  • Eduardo H. M. Pena Universidade Tecnológica Federal do Paraná (UTFPR)
  • Fábio Porto Laboratório Nacional de Computação Científica (LNCC)

Resumo


The efficacy of machine learning models are contingent on input data quality and model selection itself. In this work we highlight the importance of data quality, particularly in identifying regions within the input space that exhibit similar behavior. Clustering is used to group similar data, and is explored for their potential to enhance model performance by identifying these regions. The aim of this paper is to provide insights into the effectiveness of using clustering to improve machine learning model performance.

Palavras-chave: Time-series, clustering, k-means, quadtree, DTW

Referências

Castán-Lascorz, M., Jiménez-Herrera, P., Troncoso, A., and Asencio-Cortés, G. (2022). A new hybrid method for predicting univariate and multivariate time series based on pattern forecasting. Information Sciences, 586:611–627.

Cormen, T. H., Leiserson, C. E., Rivest, R. L., and Stein, C. (2022). Introduction to algorithms. MIT press.

de Berg, M., Cheong, O., van Kreveld, M., and Overmars, M. (2008). Computational Geometry: Algorithms and Applications. Springer Berlin Heidelberg.

Finkel, R. and Bentley, J. (1974). Quad trees: A data structure for retrieval on composite keys. Acta Inf., 4:1–9.

Montero-Manso, P. and Hyndman, R. J. (2021). Principles and algorithms for forecasting groups of time series: Locality and globality. International Journal of Forecasting, 37(4):1632–1653.

Mueen, A. and Keogh, E. J. (2016). Extracting optimal performance from dynamic time warping. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), pages 2129–2130. ACM.

Ribeiro, V., Pena, E. H. M., de Freitas Saldanha, R., Akbarinia, R., Valduriez, P., Khan, F. A., Stoyanovich, J., and Porto, F. (2023). Subset modelling: A domain partitioning strategy for data-efficient machine-learning. In Proceedings of the 38th Brazilian Symposium on Databases, SBBD 2023, Belo Horizonte, MG, Brazil, September 25-29, 2023, pages 318–323. SBC.

Vázquez, I., Villar, J. R., Sedano, J., and Simić, S. (2021). A comparison of multivariate time series clustering methods. In 15th International Conference on Soft Computing Models in Industrial and Environmental Applications (SOCO 2020) 15, pages 571–579. Springer.

Warren Liao, T. (2005). Clustering of time series data—a survey. Pattern Recognition, 38(11):1857–1874.
Publicado
14/10/2024
TORRES, Samuel R.; SALDANHA, Raphael; ZORRILLA, Rocío; RIBEIRO, Vitor; PENA, Eduardo H. M.; PORTO, Fábio. Dual-Metric Clustering for Multivariate Time Series: KMeans with DTW and QuadTree with Entropy. In: SIMPÓSIO BRASILEIRO DE BANCO DE DADOS (SBBD), 39. , 2024, Florianópolis/SC. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2024 . p. 736-742. ISSN 2763-8979. DOI: https://doi.org/10.5753/sbbd.2024.243131.