Subset Modelling: A Domain Partitioning Strategy for Data-efficient Machine-Learning

  • Vitor Ribeiro Laboratório Nacional de Computação Científica
  • Eduardo H. M. Pena Universidade Tecnológica Federal do Paraná
  • Raphael Saldanha Institut national de recherche en sciences et technologies du numérique
  • Reza Akbarinia Institut national de recherche en sciences et technologies du numérique
  • Patrick Valduriez Institut national de recherche en sciences et technologies du numérique
  • Falaah Arif Khan New York University
  • Julia Stoyanovich New York University
  • Fabio Porto Laboratório Nacional de Computação Científica


The success of machine learning (ML) systems depends on data availability, volume, quality, and efficient computing resources. A challenge in this context is to reduce computational costs while maintaining adequate accuracy of the models. This paper presents a framework to address this challenge. The idea is to identify “subdomains” within the input space, train local models that produce better predictions for samples from that specific subdomain, instead of training a single global model on the full dataset. We experimentally evaluate our approach on two real-world datasets. Our results indicate that subset modelling (i) improves the predictive performance compared to a single global model and (ii) allows data-efficient training.

Palavras-chave: machine learning systems, machine learning models, model training, data selection, model management


