Subset Modelling: A Domain Partitioning Strategy for Data-efficient Machine-Learning

Vitor Ribeiro; Eduardo H. M. Pena; Raphael Saldanha; Reza Akbarinia; Patrick Valduriez; Falaah Arif Khan; Julia Stoyanovich; Fabio Porto

doi:10.5753/sbbd.2023.232829

Vitor Ribeiro Laboratório Nacional de Computação Científica
Eduardo H. M. Pena Universidade Tecnológica Federal do Paraná
Raphael Saldanha Institut national de recherche en sciences et technologies du numérique
Reza Akbarinia Institut national de recherche en sciences et technologies du numérique
Patrick Valduriez Institut national de recherche en sciences et technologies du numérique
Falaah Arif Khan New York University
Julia Stoyanovich New York University
Fabio Porto Laboratório Nacional de Computação Científica

DOI: https://doi.org/10.5753/sbbd.2023.232829

Resumo

The success of machine learning (ML) systems depends on data availability, volume, quality, and efficient computing resources. A challenge in this context is to reduce computational costs while maintaining adequate accuracy of the models. This paper presents a framework to address this challenge. The idea is to identify “subdomains” within the input space, train local models that produce better predictions for samples from that specific subdomain, instead of training a single global model on the full dataset. We experimentally evaluate our approach on two real-world datasets. Our results indicate that subset modelling (i) improves the predictive performance compared to a single global model and (ii) allows data-efficient training.

Palavras-chave: machine learning systems, machine learning models, model training, data selection, model management

Referências

Cabrera, M. and et al (2022). Dengue prediction in latin america using machine learning and the one health perspective: A literature review. Tropical Medicine and Infectious Disease, 7(10):322.

Chouldechova, A. and Roth, A. (2020). A snapshot of the frontiers of fairness in machine learning. Commun. ACM, 63(5):82–89.

Ding, F. and et al (2021). Retiring adult: New datasets for fair machine learning. In Ranzato, M., Beygelzimer, A., Dauphin, Y. N., Liang, P., and Vaughan, J. W., editors, NeurIPS 2021, December 6-14, 2021, virtual, pages 6478–6490.

Khan, F. A. and Stoyanovich, J. (2023). The unbearable weight of massive privilege: Revisiting bias-variance trade-offs in the context of fair prediction. arXiv preprint arXiv:2302.08704.

Mirzasoleiman, B., Bilmes, J. A., and Leskovec, J. (2020). Coresets for data-efficient training of machine learning models. In ICML 2020, 13-18 July 2020, volume 119, pages 6950–6960. PMLR.

Shalev-Shwartz, S. and Ben-David, S. (2014). Understanding machine learning: From theory to algorithms. Cambridge University Press.

Wei, K., Iyer, R., and Bilmes, J. (2015). Submodularity in data subset selection and active learning. In Proceedings of the 32nd International Conference on International Conference on Machine Learning - Volume 37, ICML’15, page 1954–1963. https://jmlr.org/.

Zhang, D., Maslej, N., Brynjolfsson, E., Etchemendy, J., Lyons, T., Manyika, J., Ngo, H., Niebles, J. C., Sellitto, M., Sakhaee, E., Shoham, Y., Clark, J., and Perrault, R. (2022). The ai index 2022 annual report.