Selecting efficient VM types to train deep learning models on Amazon SageMaker

Rafael Keller Tesser; Alvaro Marques; Edson Borin

Rafael Keller Tesser UNICAMP
Alvaro Marques UNICAMP
Edson Borin UNICAMP

Resumo

The cloud has become a popular environment for running Deep Learning (DL) applications. Public cloud providers charge by the amount time the resources are actually used, with the price by hour depending on the configuration of the chosen cloud instance. Instances are usually provided in the form of a VM that gives access to a certain hardware configuration, and may also come with a pre-configured software environment. More advanced, and theoretically faster, VMs are usually more expensive, but may not necessarily provide the best performance for all applications. Therefore, in order to choose the best instance (or VM type), users must consider the relative performances (and consequent cost) of different VMs when running their specific target application. Taking this into account, we propose a model to estimate the relative performance and cost of training deep learning applications running in different VM instances. This model is built upon observations derived from the performance profile of executions of three different DL applications, on 12 different public cloud instances. We argue that this model is a valuable tool for cloud users looking for optimal VM types to train their deep learning applications on the cloud.

Palavras-chave: Deep learning, Training, Cloud computing, Costs, Computational modeling, High performance computing, Conferences, cloud computing, machine learning, deep learning, performance prediction, cost prediction