Subset Modelling: A Domain Partitioning Strategy for Data-efficient Machine-Learning

  • Vitor Ribeiro Laboratório Nacional de Computação Científica
  • Eduardo H. M. Pena Universidade Tecnológica Federal do Paraná
  • Raphael Saldanha Institut national de recherche en sciences et technologies du numérique
  • Reza Akbarinia Institut national de recherche en sciences et technologies du numérique
  • Patrick Valduriez Institut national de recherche en sciences et technologies du numérique
  • Falaah Arif Khan New York University
  • Julia Stoyanovich New York University
  • Fabio Porto Laboratório Nacional de Computação Científica

Resumo


The success of machine learning (ML) systems depends on data availability, volume, quality, and efficient computing resources. A challenge in this context is to reduce computational costs while maintaining adequate accuracy of the models. This paper presents a framework to address this challenge. The idea is to identify “subdomains” within the input space, train local models that produce better predictions for samples from that specific subdomain, instead of training a single global model on the full dataset. We experimentally evaluate our approach on two real-world datasets. Our results indicate that subset modelling (i) improves the predictive performance compared to a single global model and (ii) allows data-efficient training.

Palavras-chave: machine learning systems, machine learning models, model training, data selection, model management

Referências

Cabrera, M. and et al (2022). Dengue prediction in latin america using machine learning and the one health perspective: A literature review. Tropical Medicine and Infectious Disease, 7(10):322.

Chouldechova, A. and Roth, A. (2020). A snapshot of the frontiers of fairness in machine learning. Commun. ACM, 63(5):82–89.

Ding, F. and et al (2021). Retiring adult: New datasets for fair machine learning. In Ranzato, M., Beygelzimer, A., Dauphin, Y. N., Liang, P., and Vaughan, J. W., editors, NeurIPS 2021, December 6-14, 2021, virtual, pages 6478–6490.

Khan, F. A. and Stoyanovich, J. (2023). The unbearable weight of massive privilege: Revisiting bias-variance trade-offs in the context of fair prediction. arXiv preprint arXiv:2302.08704.

Mirzasoleiman, B., Bilmes, J. A., and Leskovec, J. (2020). Coresets for data-efficient training of machine learning models. In ICML 2020, 13-18 July 2020, volume 119, pages 6950–6960. PMLR.

Shalev-Shwartz, S. and Ben-David, S. (2014). Understanding machine learning: From theory to algorithms. Cambridge University Press.

Wei, K., Iyer, R., and Bilmes, J. (2015). Submodularity in data subset selection and active learning. In Proceedings of the 32nd International Conference on International Conference on Machine Learning - Volume 37, ICML’15, page 1954–1963. https://jmlr.org/.

Zhang, D., Maslej, N., Brynjolfsson, E., Etchemendy, J., Lyons, T., Manyika, J., Ngo, H., Niebles, J. C., Sellitto, M., Sakhaee, E., Shoham, Y., Clark, J., and Perrault, R. (2022). The ai index 2022 annual report.
Publicado
25/09/2023
RIBEIRO, Vitor; PENA, Eduardo H. M.; SALDANHA, Raphael; AKBARINIA, Reza; VALDURIEZ, Patrick; KHAN, Falaah Arif; STOYANOVICH, Julia; PORTO, Fabio. Subset Modelling: A Domain Partitioning Strategy for Data-efficient Machine-Learning. In: SIMPÓSIO BRASILEIRO DE BANCO DE DADOS (SBBD), 38. , 2023, Belo Horizonte/MG. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2023 . p. 318-323. ISSN 2763-8979. DOI: https://doi.org/10.5753/sbbd.2023.232829.