Embedding Representations for AutoML Pipelines

Camila Santana Braz; Matheus Cândido Teixeira; Gisele Lobo Pappa

Camila Santana Braz UFMG
Matheus Cândido Teixeira UFMG
Gisele Lobo Pappa UFMG

Resumo

The area of Automated Machine Learning (AutoML) emerged to automate the tedious process of manually testing different sets of algorithms hyperparameters and other data engineering tasks involved in the process of solving a machine learning (ML) problem. While many researchers focus on developing and refining techniques, there have been few advances in understanding the models and how the optimization works. One way to tackle this problem is to investigate the fitness landscape and analyze the distance between solutions to describe this environment. These techniques require calculating distances between solutions in the search space. This is a problem, as in a diverse range of AutoML methods, machine learning pipelines are represented by a tree structure, which has limitations in computational time for calculating distances and does not account for the semantics for the solutions. In this direction, this paper proposes a new way to represent ML pipelines using embeddings. We use a Transformer model to generate embeddings of machine learning pipelines, and then evaluated the embeddings using the correlation between the distances calculated when using the two representations. We also perform a qualitative and a visual analyses to compare both representations. Developing this representation allows researchers to improve current AutoML methods by providing a better understanding of how difficult it is to search for them.