Dual-Metric Clustering for Multivariate Time Series: KMeans with DTW and QuadTree with Entropy

  • Samuel R. Torres National Laboratory for Scientific Computing (LNCC)
  • Raphael Saldanha National Institute for Research in Digital Science and Technology (INRIA)
  • Rocío Zorrilla National Laboratory for Scientific Computing (LNCC)
  • Vitor Ribeiro National Laboratory for Scientific Computing (LNCC)
  • Eduardo H. M. Pena Federal University of Technology - Paraná (UTFPR)
  • Fábio Porto National Laboratory for Scientific Computing (LNCC)

Abstract


The efficacy of machine learning models are contingent on input data quality and model selection itself. In this work we highlight the importance of data quality, particularly in identifying regions within the input space that exhibit similar behavior. Clustering is used to group similar data, and is explored for their potential to enhance model performance by identifying these regions. The aim of this paper is to provide insights into the effectiveness of using clustering to improve machine learning model performance.

Keywords: Time-series, clustering, k-means, quadtree, DTW

References

Castán-Lascorz, M., Jiménez-Herrera, P., Troncoso, A., and Asencio-Cortés, G. (2022). A new hybrid method for predicting univariate and multivariate time series based on pattern forecasting. Information Sciences, 586:611–627.

Cormen, T. H., Leiserson, C. E., Rivest, R. L., and Stein, C. (2022). Introduction to algorithms. MIT press.

de Berg, M., Cheong, O., van Kreveld, M., and Overmars, M. (2008). Computational Geometry: Algorithms and Applications. Springer Berlin Heidelberg.

Finkel, R. and Bentley, J. (1974). Quad trees: A data structure for retrieval on composite keys. Acta Inf., 4:1–9.

Montero-Manso, P. and Hyndman, R. J. (2021). Principles and algorithms for forecasting groups of time series: Locality and globality. International Journal of Forecasting, 37(4):1632–1653.

Mueen, A. and Keogh, E. J. (2016). Extracting optimal performance from dynamic time warping. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), pages 2129–2130. ACM.

Ribeiro, V., Pena, E. H. M., de Freitas Saldanha, R., Akbarinia, R., Valduriez, P., Khan, F. A., Stoyanovich, J., and Porto, F. (2023). Subset modelling: A domain partitioning strategy for data-efficient machine-learning. In Proceedings of the 38th Brazilian Symposium on Databases, SBBD 2023, Belo Horizonte, MG, Brazil, September 25-29, 2023, pages 318–323. SBC.

Vázquez, I., Villar, J. R., Sedano, J., and Simić, S. (2021). A comparison of multivariate time series clustering methods. In 15th International Conference on Soft Computing Models in Industrial and Environmental Applications (SOCO 2020) 15, pages 571–579. Springer.

Warren Liao, T. (2005). Clustering of time series data—a survey. Pattern Recognition, 38(11):1857–1874.
Published
2024-10-14
TORRES, Samuel R.; SALDANHA, Raphael; ZORRILLA, Rocío; RIBEIRO, Vitor; PENA, Eduardo H. M.; PORTO, Fábio. Dual-Metric Clustering for Multivariate Time Series: KMeans with DTW and QuadTree with Entropy. In: BRAZILIAN SYMPOSIUM ON DATABASES (SBBD), 39. , 2024, Florianópolis/SC. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2024 . p. 736-742. ISSN 2763-8979. DOI: https://doi.org/10.5753/sbbd.2024.243131.