Subdomain Identification Strategies for Efficient Machine Learning Models
Abstract
The performance of machine learning models depends on both the quality of the data and the selection of the models. This work highlights the importance of identifying regions in the input space with similar behavior to improve predictive accuracy. Clustering techniques can partition multivariate time series (MTS) into meaningful subsets, enabling the training of specialized models. We employ k-Medoids and quadtree-based clustering, both using dynamic time warping (DTW) as a similarity measure, and the quadtree also incorporates entropy for partitioning. Long Short-Term Memory (LSTM) networks are trained on these clusters and compared to a global model trained on the entire dataset. The results support the subset modeling hypothesis, showing that models trained in clusters can achieve comparable performance to a global model. This approach offers a comparable alternative that balances prediction accuracy with computational and interpretable advantages.
Keywords:
multivariate timeseries, machine learning, subset modeling
References
Angelo, A. (2016). A brief introduction to quadtrees and their applications. In Style file from the 28th Canadian Conference on Computational Geometry.
Basu, D. and Sengupta, S. (2015). A novel quad tree based data clustering technique. In 2015 IEEE International Conference on Research in Computational Intelligence and Communication Networks (ICRCICN), pages 157–162. IEEE.
Cao, D. and Liu, J. (2016). Research on dynamic time warping multivariate time series similarity matching based on shape feature and inclination angle. Journal of Cloud Computing, 5(1):11.
Finkel, R. and Bentley, J. (1974). Quad trees: A data structure for retrieval on composite keys. Acta Inf., 4:1–9.
Hochreiter, S. (1997). Long short-term memory. Neural Computation MIT-Press.
Instituto Nacional de Meteorologia (2024). Dados históricos - inmet. Accessed: 2024-06.
Kaufman, L. and Rousseeuw, P. J. (2009). Finding groups in data: an introduction to cluster analysis. John Wiley & Sons.
MacQueen, J. et al. (1967). Some methods for classification and analysis of multivariate observations. In Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, volume 1, pages 281–297. Oakland, CA, USA.
Montero-Manso, P. and Hyndman, R. J. (2021). Principles and algorithms for forecasting groups of time series: Locality and globality. International Journal of Forecasting, 37(4):1632–1653.
Park, H.-S. and Jun, C.-H. (2009). A simple and fast algorithm for k-medoids clustering. Expert systems with applications, 36(2):3336–3341.
Ribeiro, V., Pena, E. H., Saldanha, R., Akbarinia, R., Valduriez, P., Khan, F. A., Stoyanovich, J., and Porto, F. (2023). Subset modelling: A domain partitioning strategy for data-efficient machine-learning. In Anais do XXXVIII Simpósio Brasileiro de Bancos de Dados, pages 318–323. SBC.
Sakoe, H. and Chiba, S. (2003). Dynamic programming algorithm optimization for spoken word recognition. IEEE transactions on acoustics, speech, and signal processing, 26(1):43–49.
Singhal, A. and Seborg, D. E. (2005). Clustering multivariate time-series data. Journal of Chemometrics: A Journal of the Chemometrics Society, 19(8):427–438.
Zorrilla Coz, R. M. (2021). A Spatial-Temporal Aware Model Selection for Time Series Analysis. PhD thesis, Laboratório Nacional de Computação Científica, Petrópolis, RJ, Brasil. Thesis for the degree of Doctor of Sciences in Computational Modeling.
Basu, D. and Sengupta, S. (2015). A novel quad tree based data clustering technique. In 2015 IEEE International Conference on Research in Computational Intelligence and Communication Networks (ICRCICN), pages 157–162. IEEE.
Cao, D. and Liu, J. (2016). Research on dynamic time warping multivariate time series similarity matching based on shape feature and inclination angle. Journal of Cloud Computing, 5(1):11.
Finkel, R. and Bentley, J. (1974). Quad trees: A data structure for retrieval on composite keys. Acta Inf., 4:1–9.
Hochreiter, S. (1997). Long short-term memory. Neural Computation MIT-Press.
Instituto Nacional de Meteorologia (2024). Dados históricos - inmet. Accessed: 2024-06.
Kaufman, L. and Rousseeuw, P. J. (2009). Finding groups in data: an introduction to cluster analysis. John Wiley & Sons.
MacQueen, J. et al. (1967). Some methods for classification and analysis of multivariate observations. In Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, volume 1, pages 281–297. Oakland, CA, USA.
Montero-Manso, P. and Hyndman, R. J. (2021). Principles and algorithms for forecasting groups of time series: Locality and globality. International Journal of Forecasting, 37(4):1632–1653.
Park, H.-S. and Jun, C.-H. (2009). A simple and fast algorithm for k-medoids clustering. Expert systems with applications, 36(2):3336–3341.
Ribeiro, V., Pena, E. H., Saldanha, R., Akbarinia, R., Valduriez, P., Khan, F. A., Stoyanovich, J., and Porto, F. (2023). Subset modelling: A domain partitioning strategy for data-efficient machine-learning. In Anais do XXXVIII Simpósio Brasileiro de Bancos de Dados, pages 318–323. SBC.
Sakoe, H. and Chiba, S. (2003). Dynamic programming algorithm optimization for spoken word recognition. IEEE transactions on acoustics, speech, and signal processing, 26(1):43–49.
Singhal, A. and Seborg, D. E. (2005). Clustering multivariate time-series data. Journal of Chemometrics: A Journal of the Chemometrics Society, 19(8):427–438.
Zorrilla Coz, R. M. (2021). A Spatial-Temporal Aware Model Selection for Time Series Analysis. PhD thesis, Laboratório Nacional de Computação Científica, Petrópolis, RJ, Brasil. Thesis for the degree of Doctor of Sciences in Computational Modeling.
Published
2025-09-29
How to Cite
TORRES, Samuel R.; ZORRILLA, Rocio; SALDANHA, Raphael; RIBERIRO, Victor; PENA, Eduardo H. M.; PORTO, Fabio.
Subdomain Identification Strategies for Efficient Machine Learning Models. In: BRAZILIAN SYMPOSIUM ON DATABASES (SBBD), 40. , 2025, Fortaleza/CE.
Anais [...].
Porto Alegre: Sociedade Brasileira de Computação,
2025
.
p. 879-885.
ISSN 2763-8979.
DOI: https://doi.org/10.5753/sbbd.2025.247781.
