Algoritmo de Monte Carlo Tree Search para SSPs sob o critério GUBS
Resumo
Em planejamento probabilístico, um problema comum é o de caminho estocastico mais curto (SSP), no qual se deseja encontrar políticas que chegam a uma meta com o menor custo acumulado esperado. A presença de estados a partir dos quais não é possível alcançar uma meta, os dead-ends, torna esse critério indefinido nesses estados. Por isso, existem extensões e novas formulações para esse caso particular, como o critério GUBS. Embora existam algoritmos para solucionar esse critério, todos eles necessitam iterar pelo espaço completo de estados. Tendo isso em vista, no presente trabalho é proposto o UCT-GUBS, um algoritmo online de planejamento baseado no UCT para resolver SSPs sob o criterio GUBS. Os resultados dos experimentos realizados mostram empiricamente que o algoritmo demonstra um compromisso entre custo e probabilidade para a meta conforme a variação dos seus parâmetros, podendo tomar uma decisão em um tempo menor ou maior dependendo do tempo limite passado para o algoritmo, que é uma característica importante do UCT.
Referências
Bellman, R., Corporation, R., and Collection, K. M. R. (1957). Dynamic Programming. Rand Corporation research study. Princeton University Press.
Bertsekas, D. (1995). Dynamic Programming and Optimal Control. Athena Scientific, Belmont, Mass.
Bonet, B. and Geffner, H. (2003). Labeled RTDP: Improving the convergence of real-time dynamic programming. In ICAPS, volume 3, pages 12–21.
Freire, V. and Delgado, K. V. (2017). GUBS: a utility-based semantic for goal-directed markov decision processes. In Proceedings of the 16th Conference on Autonomous Agents and Multiagent Systems, pages 741–749.
Freire, V., Delgado, K. V., and Reis, W. A. S. (2019). An exact algorithm to make a trade-off between cost and probability in SSPs. In Proceedings of the International Conference on Automated Planning and Scheduling, volume 29, pages 146–154.
Grzes, M., Hoey, J., and Sanner, S. (2014). IPPC discrete track results. In International Conference on Automated Planning and Scheduling. Acessado em julho de 2020.
Hansen, E. A. and Zilberstein, S. (2001). LAO*: A heuristic search algorithm that finds solutions with loops. Artificial Intelligence, 129(1-2):35–62.
Keller, T. (2018). International planning competition 2018 - probabilistic tracks. In International Conference on Automated Planning and Scheduling. Acessado em julho de 2020.
Keller, T. and Eyerich, P. (2012). PROST: Probabilistic planning based on UCT. In Proceedings of the Twenty-Second International Conference on International Conference on Automated Planning and Scheduling, pages 119–127.
Kocsis, L. and Szepesvári, C. (2006). Bandit based Monte-Carlo planning. In European conference on machine learning, pages 282–293. Springer.
Kolobov, A., Weld, D., et al. (2012). A theory of goal-oriented MDPs with dead ends. In Uncertainty in artificial intelligence : proceedings of the Twenty-eighth Conference [on uncertainty in artificial intelligence] (2012), pages 438–447.
Kolobov, A., Weld, D. S., and Geffner, H. (2011). Heuristic search for generalized stochastic shortest path mdps. In Proceedings of the Twenty-First International Conference on International Conference on Automated Planning and Scheduling, pages 130–137.
Puterman, M. (1994). Markov decision processes : discrete stochastic dynamic programming. Wiley, New York.
Sanner, S. and Yoon, S. (2011). IPPC results presentation. In International Conference on Automated Planning and Scheduling. Acessado em julho de 2020.
Teichteil-Königsbuch, F. (2012). Stochastic safest and shortest path problems. In Prceedings of the Twenty-Sixth AAAI Conference on Artificial Intelligence, pages 1825– 1831.
Auer, P., Cesa-Bianchi, N., and Fischer, P. (2002). Finite-time analysis of the multiarmed bandit problem. Machine learning, 47(2-3):235–256.
Bellman, R., Corporation, R., and Collection, K. M. R. (1957). Dynamic Programming. Rand Corporation research study. Princeton University Press.
Bertsekas, D. (1995). Dynamic Programming and Optimal Control. Athena Scientific, Belmont, Mass.
Bonet, B. and Geffner, H. (2003). Labeled RTDP: Improving the convergence of real-time dynamic programming. In ICAPS, volume 3, pages 12–21.
Freire, V. and Delgado, K. V. (2017). GUBS: a utility-based semantic for goal-directed markov decision processes. In Proceedings of the 16th Conference on Autonomous Agents and Multiagent Systems, pages 741–749.
Freire, V., Delgado, K. V., and Reis, W. A. S. (2019). An exact algorithm to make a trade-off between cost and probability in SSPs. In Proceedings of the International Conference on Automated Planning and Scheduling, volume 29, pages 146–154.
Grzes, M., Hoey, J., and Sanner, S. (2014). IPPC discrete track results. In International Conference on Automated Planning and Scheduling. Acessado em julho de 2020.
Hansen, E. A. and Zilberstein, S. (2001). LAO*: A heuristic search algorithm that finds solutions with loops. Artificial Intelligence, 129(1-2):35–62.
Keller, T. (2018). International planning competition 2018 - probabilistic tracks. In International Conference on Automated Planning and Scheduling. Acessado em julho de 2020.
Keller, T. and Eyerich, P. (2012). PROST: Probabilistic planning based on UCT. In Proceedings of the Twenty-Second International Conference on International Conference on Automated Planning and Scheduling, pages 119–127.
Kocsis, L. and Szepesvári, C. (2006). Bandit based Monte-Carlo planning. In European conference on machine learning, pages 282–293. Springer.
Kolobov, A., Weld, D., et al. (2012). A theory of goal-oriented MDPs with dead ends. In Uncertainty in artificial intelligence : proceedings of the Twenty-eighth Conference [on uncertainty in artificial intelligence] (2012), pages 438–447.
Kolobov, A., Weld, D. S., and Geffner, H. (2011). Heuristic search for generalized stochastic shortest path mdps. In Proceedings of the Twenty-First International Conference on International Conference on Automated Planning and Scheduling, pages 130–137.
Puterman, M. (1994). Markov decision processes : discrete stochastic dynamic programming. Wiley, New York.
Sanner, S. and Yoon, S. (2011). IPPC results presentation. In International Conference on Automated Planning and Scheduling. Acessado em julho de 2020.
Teichteil-Königsbuch, F. (2012). Stochastic safest and shortest path problems. In Prceedings of the Twenty-Sixth AAAI Conference on Artificial Intelligence, pages 1825– 1831.