Monte Carlo Tree Search Algorithm for SSPs Under the GUBS Criterion

  • Gabriel Crispino Universidade de São Paulo
  • Valdinei Freire Universidade de São Paulo
  • Karina Valdivia Delgado Universidade de São Paulo

Abstract


In probabilistic planning, an usual problem is the Stochastic Shortest Path (SSP) in a Markov Decision Process (MDP), where the objective is to find policies that reach a goal with the minimum expected accumulated cost. In the presence of states from where it is not possible to reach a goal, this criterion is not well-defined in these states. Because of that, there are extensions and other formulations for this particular case, like the GUBS criterion. Although algorithms to solve SSPs under this criterion exist, they need to make sweeps
through the entire state space. With that in mind, this work proposes UCT-GUBS: an online planning algorithm based on UCT to solve SSPs under the GUBS criterion. The results of the experiments that were performed show empirically that the algorithm makes the desired trade-off between cost and probability to goal when its parameters are varied, with the ability of making a decision in less or more time depending on the time limit provided to the algorithm, an important characteristic of UCT.

Keywords: Probabilistic Planning, Markov Decision Processes Stochastic Shortest Path, GUBS

References

Auer, P., Cesa-Bianchi, N., and Fischer, P. (2002). Finite-time analysis of the multiarmed bandit problem. Machine learning, 47(2-3):235–256.

Bellman, R., Corporation, R., and Collection, K. M. R. (1957). Dynamic Programming. Rand Corporation research study. Princeton University Press.

Bertsekas, D. (1995). Dynamic Programming and Optimal Control. Athena Scientific, Belmont, Mass.

Bonet, B. and Geffner, H. (2003). Labeled RTDP: Improving the convergence of real-time dynamic programming. In ICAPS, volume 3, pages 12–21.

Freire, V. and Delgado, K. V. (2017). GUBS: a utility-based semantic for goal-directed markov decision processes. In Proceedings of the 16th Conference on Autonomous Agents and Multiagent Systems, pages 741–749.

Freire, V., Delgado, K. V., and Reis, W. A. S. (2019). An exact algorithm to make a trade-off between cost and probability in SSPs. In Proceedings of the International Conference on Automated Planning and Scheduling, volume 29, pages 146–154.

Grzes, M., Hoey, J., and Sanner, S. (2014). IPPC discrete track results. In International Conference on Automated Planning and Scheduling. Acessado em julho de 2020.

Hansen, E. A. and Zilberstein, S. (2001). LAO*: A heuristic search algorithm that finds solutions with loops. Artificial Intelligence, 129(1-2):35–62.

Keller, T. (2018). International planning competition 2018 - probabilistic tracks. In International Conference on Automated Planning and Scheduling. Acessado em julho de 2020.

Keller, T. and Eyerich, P. (2012). PROST: Probabilistic planning based on UCT. In Proceedings of the Twenty-Second International Conference on International Conference on Automated Planning and Scheduling, pages 119–127.

Kocsis, L. and Szepesvári, C. (2006). Bandit based Monte-Carlo planning. In European conference on machine learning, pages 282–293. Springer.

Kolobov, A., Weld, D., et al. (2012). A theory of goal-oriented MDPs with dead ends. In Uncertainty in artificial intelligence : proceedings of the Twenty-eighth Conference [on uncertainty in artificial intelligence] (2012), pages 438–447.

Kolobov, A., Weld, D. S., and Geffner, H. (2011). Heuristic search for generalized stochastic shortest path mdps. In Proceedings of the Twenty-First International Conference on International Conference on Automated Planning and Scheduling, pages 130–137.

Puterman, M. (1994). Markov decision processes : discrete stochastic dynamic programming. Wiley, New York.

Sanner, S. and Yoon, S. (2011). IPPC results presentation. In International Conference on Automated Planning and Scheduling. Acessado em julho de 2020.

Teichteil-Königsbuch, F. (2012). Stochastic safest and shortest path problems. In Prceedings of the Twenty-Sixth AAAI Conference on Artificial Intelligence, pages 1825– 1831.

Auer, P., Cesa-Bianchi, N., and Fischer, P. (2002). Finite-time analysis of the multiarmed bandit problem. Machine learning, 47(2-3):235–256.

Bellman, R., Corporation, R., and Collection, K. M. R. (1957). Dynamic Programming. Rand Corporation research study. Princeton University Press.

Bertsekas, D. (1995). Dynamic Programming and Optimal Control. Athena Scientific, Belmont, Mass.

Bonet, B. and Geffner, H. (2003). Labeled RTDP: Improving the convergence of real-time dynamic programming. In ICAPS, volume 3, pages 12–21.

Freire, V. and Delgado, K. V. (2017). GUBS: a utility-based semantic for goal-directed markov decision processes. In Proceedings of the 16th Conference on Autonomous Agents and Multiagent Systems, pages 741–749.

Freire, V., Delgado, K. V., and Reis, W. A. S. (2019). An exact algorithm to make a trade-off between cost and probability in SSPs. In Proceedings of the International Conference on Automated Planning and Scheduling, volume 29, pages 146–154.

Grzes, M., Hoey, J., and Sanner, S. (2014). IPPC discrete track results. In International Conference on Automated Planning and Scheduling. Acessado em julho de 2020.

Hansen, E. A. and Zilberstein, S. (2001). LAO*: A heuristic search algorithm that finds solutions with loops. Artificial Intelligence, 129(1-2):35–62.

Keller, T. (2018). International planning competition 2018 - probabilistic tracks. In International Conference on Automated Planning and Scheduling. Acessado em julho de 2020.

Keller, T. and Eyerich, P. (2012). PROST: Probabilistic planning based on UCT. In Proceedings of the Twenty-Second International Conference on International Conference on Automated Planning and Scheduling, pages 119–127.

Kocsis, L. and Szepesvári, C. (2006). Bandit based Monte-Carlo planning. In European conference on machine learning, pages 282–293. Springer.

Kolobov, A., Weld, D., et al. (2012). A theory of goal-oriented MDPs with dead ends. In Uncertainty in artificial intelligence : proceedings of the Twenty-eighth Conference [on uncertainty in artificial intelligence] (2012), pages 438–447.

Kolobov, A., Weld, D. S., and Geffner, H. (2011). Heuristic search for generalized stochastic shortest path mdps. In Proceedings of the Twenty-First International Conference on International Conference on Automated Planning and Scheduling, pages 130–137.

Puterman, M. (1994). Markov decision processes : discrete stochastic dynamic programming. Wiley, New York.

Sanner, S. and Yoon, S. (2011). IPPC results presentation. In International Conference on Automated Planning and Scheduling. Acessado em julho de 2020.

Teichteil-Königsbuch, F. (2012). Stochastic safest and shortest path problems. In Prceedings of the Twenty-Sixth AAAI Conference on Artificial Intelligence, pages 1825– 1831.
Published
2020-10-20
CRISPINO, Gabriel; FREIRE, Valdinei; VALDIVIA DELGADO, Karina. Monte Carlo Tree Search Algorithm for SSPs Under the GUBS Criterion. In: NATIONAL MEETING ON ARTIFICIAL AND COMPUTATIONAL INTELLIGENCE (ENIAC), 17. , 2020, Evento Online. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2020 . p. 402-413. ISSN 2763-9061. DOI: https://doi.org/10.5753/eniac.2020.12146.