Exact Algorithm to Evaluate Stationary Policies for CVaR MDP

  • Denis Pais Universidade de São Paulo
  • Karina Delgado Universidade de São Paulo
  • Valdinei Freire Universidade de São Paulo

Abstract


Markov Decision Processes (MDPs) are widely used to model sequential-decision problems. The most widely used performance criterion in MDPs is the minimization of the total expected cost. However, this approach does not take into account variations around the mean, and very different results are evaluated equally; MDPs that deal with this kind of problem are called risk sensitive MDPs. A special type of risk-sensitive MDP is CVaR MDP, which includes the CVaR (Conditional-Value-at-Risk) metric commonly used in finance. One algorithm that finds the optimal policy for CVaR MDPs is the Linear Interpolation Value Iteration algorithm called CVaRVILI. To obtain an approximated optimal policy, the CVaRVILI algorithm solves linear programming problems several times, and CVaRVILI algorithm is computational expensive. In this work, we propose the PECVaR algorithm that evaluates stationary policies for CVaR MDPs under constant cost and does not need to solve linear programming problems. Although PECVaR algorithm cannot be used to improve a policy, we hypothesized that the CVaR value of an arbitrary policy is a better initialization to CVaRVILI algorithm than the expected value of the same policy. We evaluate empirically such a hypothesis and the results show that using PECVaR initialization decreases the CVaRVILI algorithm convergence time in most cases.

Keywords: MDP, Risk, CVaR, Sequential Decision

References

Bellman, R. E. (1957). Dynamic programming. Princeton University Press.

Carpin, S., Chow, Y.-L., & Pavone, M. (2016). Risk aversion in finite Markov decision processes using total cost criteria and average value at risk. In 2016 ieee international conference on robotics and automation (icra) (pp. 335–342).

Chow, Y. (2017). Risk-sensitive and data-driven sequential decision making (Unpublished doctoral dissertation). Stanford University.

Chow, Y., Tamar, A., Mannor, S., & Pavone, M. (2015). Risk-sensitive and robust decision-making: a CVaR optimization approach. Advances in Neural Information Systems, 1522–1530.

Iyengar, G., & Ma, A. K. C. (2013). Fast gradient descent method for mean-CVaR optimization. Annals of Operations Research, 205(1), 203–212.

Pflug, G. C., & Pichler, A. (2016). Time-consistent decisions and temporal decomposition of coherent risk functionals. Mathematics of Operations Research, 41(2), 682–699.

Puterman., M. L. (1994). Markov decision processes: Discrete stochastic dynamic programming. New York, NY: Wiley-Interscience.

Uryasev, S., & Rockafellar, R. T. (2001). Conditional value-at-risk: Optimization approach. Stochastic Optimization: Algorithms and Applications, 411–435.
Published
2019-10-15
PAIS, Denis; DELGADO, Karina; FREIRE, Valdinei. Exact Algorithm to Evaluate Stationary Policies for CVaR MDP. In: NATIONAL MEETING ON ARTIFICIAL AND COMPUTATIONAL INTELLIGENCE (ENIAC), 16. , 2019, Salvador. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2019 . p. 868-879. ISSN 2763-9061. DOI: https://doi.org/10.5753/eniac.2019.9341.