Flexible Deliberation Costs via Inversely Proportional Decay in the Option-Critic Architecture
Resumo
Propõe-se um custo de deliberação flexível para a arquitetura option-critic, definido como uma função inversa da duração das options. Esse custo adaptativo reduz a sensibilidade a hiperparâmetros, melhora a especialização e estabilidade das options, evita a degeneração em ações primitivas e promove comportamentos mais coerentes.
Referências
Bacon, P.-L., Harb, J., and Precup, D. (2017). The option-critic architecture. Proceedings of the AAAI 2017, 31(1).
Barto, A. G., Sutton, R. S., and Anderson, C. W. (1983). Neuronlike adaptive elements that can solve difficult learning control problems. IEEE Transactions on Systems, Man, and Cybernetics, SMC-13(5):834–846.
Harb, J., Bacon, P.-L., Klissarov, M., and Precup, D. (2018). When waiting is not an option: Learning options with a deliberation cost. Proc. of AAAI 2018, 32(1).
Harutyunyan, A., Dabney, W., Borsa, D., Heess, N., Munos, R., and Precup, D. (2019). The termination critic. In Proceedings of the Twenty-Second International Conference on Artificial Intelligence and Statistics, pages 2231–2240.
Schulman, J., Moritz, P., Levine, S., Jordan, M., and Abbeel, P. (2018). High-dimensional continuous control using generalized advantage estimation.
Sutton, R. S. and Barto, A. G. (2018). Reinforcement Learning: An Introduction. MIT Press, Cambridge, 2nd edition.
Sutton, R. S., Precup, D., and Singh, S. (1999). Between mdps and semi-mdps: A framework for temporal abstraction in reinforcement learning. Artificial Intelligence, 112(1):181–211.
Szepesvári, C. (2010). Algorithms for Reinforcement Learning. Synthesis Lectures on Artificial Intelligence and Machine Learning. Springer Cham, 1 edition.
Wawrzyński, P. (2009). A cat-like robot real-time learning to run. In Adaptive and Natural Computing Algorithms, pages 380–390, Berlin. Springer Berlin Heidelberg.
Barto, A. G., Sutton, R. S., and Anderson, C. W. (1983). Neuronlike adaptive elements that can solve difficult learning control problems. IEEE Transactions on Systems, Man, and Cybernetics, SMC-13(5):834–846.
Harb, J., Bacon, P.-L., Klissarov, M., and Precup, D. (2018). When waiting is not an option: Learning options with a deliberation cost. Proc. of AAAI 2018, 32(1).
Harutyunyan, A., Dabney, W., Borsa, D., Heess, N., Munos, R., and Precup, D. (2019). The termination critic. In Proceedings of the Twenty-Second International Conference on Artificial Intelligence and Statistics, pages 2231–2240.
Schulman, J., Moritz, P., Levine, S., Jordan, M., and Abbeel, P. (2018). High-dimensional continuous control using generalized advantage estimation.
Sutton, R. S. and Barto, A. G. (2018). Reinforcement Learning: An Introduction. MIT Press, Cambridge, 2nd edition.
Sutton, R. S., Precup, D., and Singh, S. (1999). Between mdps and semi-mdps: A framework for temporal abstraction in reinforcement learning. Artificial Intelligence, 112(1):181–211.
Szepesvári, C. (2010). Algorithms for Reinforcement Learning. Synthesis Lectures on Artificial Intelligence and Machine Learning. Springer Cham, 1 edition.
Wawrzyński, P. (2009). A cat-like robot real-time learning to run. In Adaptive and Natural Computing Algorithms, pages 380–390, Berlin. Springer Berlin Heidelberg.
Publicado
12/11/2025
Como Citar
LEAL, Augusto Antônio Fontanive; MELCHIADES, Mateus Begnini; RAMOS, Gabriel de Oliveira.
Flexible Deliberation Costs via Inversely Proportional Decay in the Option-Critic Architecture. In: ESCOLA REGIONAL DE APRENDIZADO DE MÁQUINA E INTELIGÊNCIA ARTIFICIAL DA REGIÃO SUL (ERAMIA-RS), 1. , 2025, Porto Alegre/RS.
Anais [...].
Porto Alegre: Sociedade Brasileira de Computação,
2025
.
p. 33-36.
DOI: https://doi.org/10.5753/eramiars.2025.16624.