Analysis of the Risk Sensitive Value Iteration Algorithm

Igor Oliveira Borges; Karina Valdivia Delgado; Valdinei Freire

doi:10.5753/eniac.2018.4431

Igor Oliveira Borges USP
Karina Valdivia Delgado USP
Valdinei Freire USP

DOI: https://doi.org/10.5753/eniac.2018.4431

Resumo

This paper shows an empirical study of Value Iteration Risk Sensitive algorithm proposed by Mihatsch and Neuneier (2002). This approach makes use of a risk factor that allows dealing with different types of risk attitude (prone, neutral or averse) by using a discount factor. We show experiments with the domain of Crossing the River in two different scenarios and we analyze the influence of discount factor and risk factor under two aspects: optimal policy and processing time to convergence. We observed that: (i) the processing cost in extreme risk policies is high with both risk-averse and risk-prone attitude; (ii) a high discount increases time to convergence and reinforces the chosen risk attitude; and (iii) policies with intermediate risk factor values have a low computational cost and show a certain sensitivity to risk based on the discount factor.

Referências

[Bellman 1957] Bellman, R. (1957). A Markovian decision process. Indiana Univ. Math. J., 6:679–684.

[Chung and Sobel 1987] Chung, K.-J. and Sobel, M. J. (1987). Discounted mdp’s: distribution functions and exponential utility maximization. SIAM J. Control Optim., 25:49–62.

[Denardo and Rothblum 1979] Denardo, E. V. and Rothblum, U. G. (1979). Optimal stopping, exponential utility, and linear programming. Mathematical Programming, 16(1):228–244.

[Filar et al. 1989] Filar, J. A., Kallenberg, L. C. M., and Lee, H.-M. (1989). Variancepenalized Markov decision processes. Mathematics of Operations Research, 14(1):147–161.

[Filar et al. 1995] Filar, J. A., Krass, D., Ross, K. W., and Ross, K. W. (1995). Percentile performance criteria for limiting average Markov decision processes. IEEE Transactions on Automatic Control, 40(1):2–10.

[Freire 2016] Freire, V. (2016). The role of discount factor in risk sensitive markov decision processes. In 2016 5th Brazilian Conference on Intelligent Systems (BRACIS), pages 480–485.

[Freire and Delgado 2017] Freire, V. and Delgado, K. V. (2017). GUBS: a utility-based semantic for Goal-Directed Markov Decision Processes. In Sixteenth International Conference on Autonomous Agents & Multiagent Systems, pages 741–749.

[García and Fernández 2015] García, J. and Fernández, F. (2015). A comprehensive survey on safe reinforcement learning. J. Mach. Learn. Res., 16(1):1437–1480.

[Hou et al. 2014] Hou, P., Yeoh, W., and Varakantham, P. (2014). Revisiting risk-sensitive MDPs: New algorithms and results. In Proceedings of the Twenty-Fourth International Conference on Automated Planning and Scheduling, ICAPS 2014, Portsmouth, New Hampshire, USA, June 21-26, 2014.

[Hou et al. 2016] Hou, P., Yeoh, W., and Varakantham, P. (2016). Solving risk-sensitive POMDPs with and without cost observations. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, February 12-17, 2016, Phoenix, Arizona, USA., pages 3138–3144.

[Howard and Matheson 1972] Howard, R. A. and Matheson, J. E. (1972). Risk-sensitive Markov decision processes. Management science, 18(7):356–369.

[Jaquette 1976] Jaquette, S. C. (1976). A utility criterion for Markov decision processes. Management Science, 23(1):43–49.

[Mihatsch and Neuneier 2002] Mihatsch, O. and Neuneier, R. (2002). Risk-sensitive reinforcement learning. Machine Learning, 49(2):267–290.

[Patek 2001] Patek, S. D. (2001). On terminating markov decision processes with a riskaverse objective function. Automatica, 37(9):1379–1386.

[Puterman 1994] Puterman, M. L. (1994). Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons, Inc., New York, NY, USA, 1st edition.

[Rothblum 1984] Rothblum, U. G. (1984). Multiplicative Markov decision chains. Mathematics of Operations Research, 9(1):6–24.

[Shen et al. 2014] Shen, Y., Tobia, M. J., Sommer, T., and Obermayer, K. (2014). Risksensitive reinforcement learning. Neural computation, 26(7):1298–1328.

[Sobel 1982] Sobel, M. J. (1982). The variance of discounted Markov decision processes. Journal of Applied Probability, 19(4):794–802.

[Yu et al. 1998] Yu, S. X., Lin, Y., and Yan, P. (1998). Optimization models for the first arrival target distribution function in discrete time. Journal of Mathematical Analysis and Applications, 225(1):193 – 223.

Analysis of the Risk Sensitive Value Iteration Algorithm

Resumo

Referências

Artigos mais lidos do(s) mesmo(s) autor(es)