Reinforcement Learning with Utility-Based Semantic for Goals

Christian Delgado Polar; Karina Valdivia Delgado; Valdinei Freire

Christian Delgado Polar USP / San Pablo Catholic University
Karina Valdivia Delgado USP
Valdinei Freire USP

Resumo

Stochastic shortest path problems (SSPs) are Markov decision processes with goal states and the problem is to find policies to achieve the goal with the lowest possible expected cost. Something common in this type of problem is the existence of states from which it is not possible to reach the goal, these states are called dead-end. In this case, it is important to have methods that consider the importance, not only of cost, but also of the probability of achieving the goal. In Reinforcement Learning (RL), the problem of making this trade-off between cost to goal and probability to goal has been little studied and the common strategy to deal with SSPs with dead-ends is using discounts. In some works, penalties are used on exploration in SSPs with dead-ends. However, using discounts and penalties can lead to errors in finding a policy with a desired trade-off. The GUBS criterion (Goals with Utility-Based Semantic) considers this type of trade-off without using discounts or penalties, has good semantics based on the expected utility theory and has been used to solve SSPs with dead-end states in the planning area. Thus, in this work, the GUBS criterion is used to propose the first two algorithms for RL to do a trade-off between probability to goal and cost to goal, without using discounts or penalties: the Q-learning-GUBS and Q-learning-eGUBS+Cmax. Theoretical and experimental results show that the proposed algorithms make this trade-off according to the configuration of the GUBS’ parameters (Code enabled on https://github.com/QlearningGubs/code).