Analysis of the Risk Sensitive Value Iteration Algorithm
Resumo
This paper shows an empirical study of Value Iteration Risk Sensitive algorithm proposed by Mihatsch and Neuneier (2002). This approach makes use of a risk factor that allows dealing with different types of risk attitude (prone, neutral or averse) by using a discount factor. We show experiments with the domain of Crossing the River in two different scenarios and we analyze the influence of discount factor and risk factor under two aspects: optimal policy and processing time to convergence. We observed that: (i) the processing cost in extreme risk policies is high with both risk-averse and risk-prone attitude; (ii) a high discount increases time to convergence and reinforces the chosen risk attitude; and (iii) policies with intermediate risk factor values have a low computational cost and show a certain sensitivity to risk based on the discount factor.
Referências
[Bellman 1957] Bellman, R. (1957). A Markovian decision process. Indiana Univ. Math. J., 6:679–684.
[Chung and Sobel 1987] Chung, K.-J. and Sobel, M. J. (1987). Discounted mdp’s: distribution functions and exponential utility maximization. SIAM J. Control Optim., 25:49–62.
[Denardo and Rothblum 1979] Denardo, E. V. and Rothblum, U. G. (1979). Optimal stopping, exponential utility, and linear programming. Mathematical Programming, 16(1):228–244.
[Filar et al. 1989] Filar, J. A., Kallenberg, L. C. M., and Lee, H.-M. (1989). Variancepenalized Markov decision processes. Mathematics of Operations Research, 14(1):147–161.
[Filar et al. 1995] Filar, J. A., Krass, D., Ross, K. W., and Ross, K. W. (1995). Percentile performance criteria for limiting average Markov decision processes. IEEE Transactions on Automatic Control, 40(1):2–10.
[Freire 2016] Freire, V. (2016). The role of discount factor in risk sensitive markov decision processes. In 2016 5th Brazilian Conference on Intelligent Systems (BRACIS), pages 480–485.
[Freire and Delgado 2017] Freire, V. and Delgado, K. V. (2017). GUBS: a utility-based semantic for Goal-Directed Markov Decision Processes. In Sixteenth International Conference on Autonomous Agents & Multiagent Systems, pages 741–749.
[García and Fernández 2015] García, J. and Fernández, F. (2015). A comprehensive survey on safe reinforcement learning. J. Mach. Learn. Res., 16(1):1437–1480.
[Hou et al. 2014] Hou, P., Yeoh, W., and Varakantham, P. (2014). Revisiting risk-sensitive MDPs: New algorithms and results. In Proceedings of the Twenty-Fourth International Conference on Automated Planning and Scheduling, ICAPS 2014, Portsmouth, New Hampshire, USA, June 21-26, 2014.
[Hou et al. 2016] Hou, P., Yeoh, W., and Varakantham, P. (2016). Solving risk-sensitive POMDPs with and without cost observations. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, February 12-17, 2016, Phoenix, Arizona, USA., pages 3138–3144.
[Howard and Matheson 1972] Howard, R. A. and Matheson, J. E. (1972). Risk-sensitive Markov decision processes. Management science, 18(7):356–369.
[Jaquette 1976] Jaquette, S. C. (1976). A utility criterion for Markov decision processes. Management Science, 23(1):43–49.
[Mihatsch and Neuneier 2002] Mihatsch, O. and Neuneier, R. (2002). Risk-sensitive reinforcement learning. Machine Learning, 49(2):267–290.
[Patek 2001] Patek, S. D. (2001). On terminating markov decision processes with a riskaverse objective function. Automatica, 37(9):1379–1386.
[Puterman 1994] Puterman, M. L. (1994). Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons, Inc., New York, NY, USA, 1st edition.
[Rothblum 1984] Rothblum, U. G. (1984). Multiplicative Markov decision chains. Mathematics of Operations Research, 9(1):6–24.
[Shen et al. 2014] Shen, Y., Tobia, M. J., Sommer, T., and Obermayer, K. (2014). Risksensitive reinforcement learning. Neural computation, 26(7):1298–1328.
[Sobel 1982] Sobel, M. J. (1982). The variance of discounted Markov decision processes. Journal of Applied Probability, 19(4):794–802.
[Yu et al. 1998] Yu, S. X., Lin, Y., and Yan, P. (1998). Optimization models for the first arrival target distribution function in discrete time. Journal of Mathematical Analysis and Applications, 225(1):193 – 223.