Sample-Efficient Multi-Task and Multi-Objective Reinforcement Learning by Combining Multiple Behaviors

Lucas N. Alegre; Ana L. C. Bazzan; Bruno C. da Silva

doi:10.5753/ctd.2026.19119

Lucas N. Alegre UFRGS https://orcid.org/0000-0001-5465-4390
Ana L. C. Bazzan UFRGS https://orcid.org/0000-0002-2803-9607
Bruno C. da Silva University of Massachusetts https://orcid.org/0000-0002-3708-5728

DOI: https://doi.org/10.5753/ctd.2026.19119

Resumo

One of the main challenges in the field of artificial intelligence, and reinforcement learning (RL) in particular, is the development of generalist and flexible agents capable of solving multiple tasks—each requiring the agent to learn a potentially new, specialized behavior. Tackling this challenge requires agents to learn behaviors that may involve optimizing a single objective, or trading off between multiple conflicting objectives. In this thesis, we study how to design flexible RL agents that can, in a sample-efficient manner, adapt their behavior to solve any given tasks—each of which is defined by multiple (possibly conflicting) objectives. We introduce new multi-policy methods that empower RL agents to (i) carefully learn multiple behaviors, each specialized in a particular task; and (ii) combine previously-learned behaviors to efficiently identify solutions to novel tasks, which, importantly, may require the agent to assign different preferences to each of its new objectives. The methods we introduce have strong theoretical guarantees regarding the optimality of the set of behaviors learned by agents and their capability to solve new tasks in a zero-shot manner, even in the presence of function approximation errors. We evaluate the proposed methods in various challenging multi-task and multi-objective RL problems and show that our algorithms outperform various current state-of-the-art methods in domains with both discrete and continuous state and action spaces.

Referências

Abel, D., Jinnai, Y., Guo, S. Y., Konidaris, G., and Littman, M. (2018). Policy and value transfer in lifelong reinforcement learning. In Proceedings of the 35th International Conference on Machine Learning, volume 80.

Abels, A., Roijers, D. M., Lenaerts, T., Nowé, A., and Steckelmacher, D. (2019). Dynamic weights in multi-objective deep reinforcement learning. In Proceedings of the 36th International Conference on Machine Learning, volume 97.

Alegre, L. N., Bazzan, A. L. C., Barreto, A., and Silva, B. C. d. (2025a). Constructing an Optimal Behavior Basis for the Option Keyboard. In Advances in Neural Information Processing Systems 38.

Alegre, L. N., Bazzan, A. L. C., and da Silva, B. C. (2022). Optimistic linear support and successor features as a basis for optimal policy transfer. In Proceedings of the 39th International Conference on Machine Learning, volume 162.

Alegre, L. N., Bazzan, A. L. C., Nowé, A., and da Silva, B. C. (2023a). Multi-step generalized policy improvement by leveraging approximate models. In Proceedings of the 37th International Conference on Neural Information Processing Systems, volume 36.

Alegre, L. N., Bazzan, A. L. C., Roijers, D. M., Nowé, A., and da Silva, B. C. (2023b). Sample-efficient multi-objective learning via generalized policy improvement prioritization. In Proceedings of the 22nd International Conference on Autonomous Agents and Multiagent Systems.

Alegre, L. N., Bazzan, A. L. C., Roijers, D. M., Nowé, A., and da Silva, B. C. (2024). Generalized policy improvement for efficient and robust multi-objective reinforcement learning. Autonomous Agents and Multiagent Systems (JAAMAS).

Alegre, L. N., Roijers, D. M., Nowé, A., Bazzan, A. L. C., and da Silva, B. C. (2026). Generalized policy improvement for efficient and robust multi-objective reinforcement learning. Autonomous Agents and Multi-Agent Systems, 40(1).

Alegre, L. N., Serifi, A., Grandia, R., Müller, D., Knoop, E., and Bächer, M. (2025b). AMOR: Adaptive Character Control through Multi-Objective Reinforcement Learning. In Proceedings of the Special Interest Group on Computer Graphics and Interactive Techniques Conference Conference Papers, SIGGRAPH Conference Papers ’25.

Barreto, A., Dabney, W., Munos, R., Hunt, J. J., Schaul, T., van Hasselt, H. P., and Silver, D. (2017). Successor features for transfer in reinforcement learning. In Proceedings of the 31st International Conference on Neural Information Processing Systems, volume 30.

Barreto, A., Hou, S., Borsa, D., Silver, D., and Precup, D. (2020). Fast reinforcement learning with generalized policy updates. Proceedings of the National Academy of Sciences, 117(48).

Bellemare, M. G., Candido, S., Castro, P. S., Gong, J., Machado, M. C., Moitra, S., Ponda, S. S., and Wang, Z. (2020). Autonomous navigation of stratospheric balloons using reinforcement learning. Nature, 588(7836).

Felten, F., Alegre, L. N., Nowé, A., Bazzan, A. L. C., Talbi, E.-G., Danoy, G., and da Silva, B. C. (2023). A toolkit for reliable benchmarking and research in multi-objective reinforcement learning. In Proceedings of the 37th International Conference on Neural Information Processing Systems, volume 36.

Fernández, F. and Veloso, M. (2006). Probabilistic policy reuse in a reinforcement learning agent. In Proceedings of the 5th International Joint Conference on Autonomous Agents and Multiagent Systems.

Finn, C., Abbeel, P., and Levine, S. (2017). Model-agnostic meta-learning for fast adaptation of deep networks. In Proceedings of the 34th International Conference on Machine Learning (ICML), volume 70.

Hayes, C. F., Rădulescu, R., Bargiacchi, E., Källström, J., Macfarlane, M., Reymond, M., Verstraeten, T., Zintgraf, L. M., Dazeley, R., Heintz, F., Howley, E., Irissappane, A. A., Mannion, P., Nowé, A., Ramos, G., Restelli, M., Vamplew, P., and Roijers, D. M. (2022). A practical guide to multi-objective reinforcement learning and planning. Autonomous Agents and Multi-Agent Systems, 36(1).

Liu, Y., Han, T., Ma, S., Zhang, J., Yang, Y., Tian, J., He, H., Li, A., He, M., Liu, Z., Wu, Z., Zhao, L., Zhu, D., Li, X., Qiang, N., Shen, D., Liu, T., and Ge, B. (2023). Summary of ChatGPT-related research and perspective towards the future of large language models. Meta-Radiology, 1(2).

Mossalam, H., Assael, Y. M., Roijers, D. M., and Whiteson, S. (2016). Multi-objective deep reinforcement learning. arXiv preprint arXiv:1610.02707.

Pickett, M. and Barto, A. G. (2002). Policyblocks: An algorithm for creating useful macro-actions in reinforcement learning. In Proceedings of the Nineteenth International Conference on Machine Learning.

Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou, I., Huang, A., Guez, A., Hubert, T., Baker, L., Lai, M., Bolton, A., Chen, Y., Lillicrap, T., Hui, F., Sifre, L., van den Driessche, G., Graepel, T., and Hassabis, D. (2017). Mastering the game of go without human knowledge. Nature, 550(7676).

Taylor, M. E. and Stone, P. (2009). Transfer learning for reinforcement learning domains: A survey. Journal of Machine Learning Research, 10(56).

Taylor, M. E., Whiteson, S., and Stone, P. (2007). Transfer via inter-task mappings in policy search reinforcement learning. In Proceedings of the 6th International Joint Conference on Autonomous Agents and Multiagent Systems. IFAAMAS.

Van Moffaert, K. and Nowé, A. (2014). Multi-objective reinforcement learning using sets of Pareto dominating policies. Journal of Machine Learning Research, 15(1).

Wang, K., Kidambi, R., Sullivan, R., Agarwal, A., Dann, C., Michi, A., Gelmi, M., Li, Y., Gupta, R., Dubey, K. A., Rame, A., Ferret, J., Cideron, G., Hou, L., Yu, H., Ahmed, A., Mehta, A., Hussenot, L., Bachem, O., and Leurent, E. (2024). Conditional language policy: A general framework for steerable multi-objective finetuning. In Findings of the Association for Computational Linguistics: EMNLP 2024.

Xu, J., Tian, Y., Ma, P., Rus, D., Sueda, S., and Matusik, W. (2020). Prediction-guided multi-objective reinforcement learning for continuous robot control. In Proceedings of the 37th International Conference on Machine Learning, volume 119.

Yang, R., Sun, X., and Narasimhan, K. (2019). A generalized algorithm for multi-objective reinforcement learning and policy adaptation. In Proceedings of the 33rd International Conference on Neural Information Processing Systems.

Zahavy, T., Barreto, A., Mankowitz, D. J., Hou, S., O’Donoghue, B., Kemaev, I., and Singh, S. (2021). Discovering a set of policies for the worst case reward. In Proceedings of the 9th International Conference on Learning Representations.