Visual Theory of Mind for Human-Agent Collaboration in Smart Environments
Resumo
Smart environments increasingly rely on intelligent agents to support people in daily activities, particularly in assisted living scenarios. For such systems to support collaboration with humans, agents must move beyond reactive perception and develop an explicit understanding of human states, intentions, and needs. This paper presents an approach for human-agent collaboration in smart environments based on Visual Theory of Mind (VToM), enabling an assistive agent to infer human intentions and contextual beliefs from visual inputs. These visual inferences are integrated with additional multimodal signals, including contextual and interaction-based inputs, to support proactive and adaptive assistance. By combining visual reasoning with multimodal information, the proposed approach allows the agent to align its behavior with human goals and ongoing activities, fostering effective collaboration rather than unilateral automation. We discuss the agent architecture, inference mechanisms, and collaborative interaction scenarios in smart homes and assisted living environments, highlighting how VToM contributes to human-centered assistance in shared environments.
Palavras-chave:
Visual Theory of Mind, Human-Agent Collaboration, Smart Environments
Referências
Bratman, M. (1987). Intention, Plans, and Practical Reason. MA: Harvard University Press, Cambridge.
Chen, B., Vondrick, C., and Lipson, H. (2021). Visual behavior modelling for robotic theory of mind. Scientific Reports, 11(1):424.
Cook, D. J. and Das, S. K. (2007). How smart are our environments? an updated look at the state of the art. Pervasive and mobile computing, 3(2):53–73.
da Silva, H. H., Rocha, M., Trajano, G., Morales, A. S., Sarkadi, S., and Panisson, A. R. (2024). Distributed theory of mind in multi-agent systems. In Rocha, A. P., Steels, L., and van den Herik, H. J., editors, Proceedings of the 16th International Conference on Agents and Artificial Intelligence, ICAART 2024, Volume 1, Rome, Italy, February 24-26, 2024, pages 451–460. SCITEPRESS.
Fischer, T. and Demiris, Y. (2019). Computational modeling of embodied visual perspective taking. IEEE Transactions on Cognitive and Developmental Systems, 12(4):723–732.
Goertzel, B. (2014). Artificial general intelligence: Concept, state of the art, and future prospects. Journal of Artificial General Intelligence, 5(1):1.
Goldman, A. I. (2012). Theory of Mind. Oxford University Press, United Kingdom.
Hoffmann, S., Fernandes, A. R., Muchulski, V. W., Sarkadi, S., von Wangenheim, A., and Panisson, A. R. (2026a). Towards Intelligent Monitoring System Using Computer Vision. In In Proceedings of the International Conference on Agents and Artificial Intelligence (ICAART).
Hoffmann, S., Trajano, G., Sarkadi, S., and Panisson, A. R. (2025). Visual Theory of Mind (VToM): A Systematic Review. In IEEE/WIC International Conference on Web Intelligence and Intelligent Agent Technology (WI-IAT).
Hoffmann, S., Vieira, L. N., Sarkadi, S., and Panisson, A. R. (2026b). Visual theory of mind through llm-based semantic extraction. ToM4AI 2026, page 118.
Jin, C., Wu, Y., Cao, J., Xiang, J., Kuo, Y.-L., Hu, Z., Ullman, T. D., Torralba, A., Tenenbaum, J. B., and Shu, T. (2024). Mmtom-qa: Multimodal theory of mind question answering. Proceedings of the Annual Meeting of the Association for Computational Linguistics, 1:16077 – 16102.
Jocher, G., Chaurasia, A., and Qiu, J. (2023). Ultralytics yolov8.
Johnson, M. and Demiris, Y. (2007). Visuo-cognitive perspective taking for action recognition. In Int. Symp. Imitation Animimals Artifacts, pages 262–269.
Krosnick, J. A. (1999). Maximizing questionnaire quality. Measures of political attitudes, 2:37–58.
Likert, R. (1932). A technique for the measurement of attitudes. Archives of psychology.
Lugaresi, C., Tang, J., Nash, H., McClanahan, C., Uboweja, E., Hays, M., Zhang, F., Chang, C.-L., Yong, M., Lee, J., Chang, W.-T., Hua, W., Georg, M., and Grundmann, M. (2019). Mediapipe: A framework for perceiving and processing reality. In Third Workshop on Computer Vision for AR/VR at IEEE Computer Vision and Pattern Recognition (CVPR) 2019.
Luo, X., Liu, D., Dang, F., and Luo, H. (2024). Integration of llms and the physical world: Research and application. In Proceedings of the ACM Turing Award Celebration Conference-China 2024, pages 1–5.
Marin, L., Sisbot, E. A., and Alami, R. (2008). Geometric tools for perspective taking for human-robot interaction. In Mexican international conference on artificial intelligence (MICAI 2008), Mexico City, Mexico.
Polo-Rodríguez, A., Fiorini, L., Rovini, E., Cavallo, F., and Medina-Quero, J. (2025). Enhancing smart environments with context-aware chatbots using large language models. arXiv preprint arXiv:2502.14469.
Premack, D. and Woodruff, G. (1978). Does the chimpanzee have a theory of mind? Behavioral and brain sciences, 1(4):515–526.
Rivkin, D., Hogan, F., Feriani, A., Konar, A., Sigal, A., Liu, X., and Dudek, G. (2024). Aiot smart home via autonomous llm agents. IEEE Internet of Things Journal.
Rocha, M., da Silva, H. H., Morales, A. S., Sarkadi, S., and Panisson, A. R. (2023). Applying theory of mind to multi-agent systems: A systematic review. In Brazilian Conference on Intelligent Systems, pages 367–381. Springer.
Varol, A., Motlagh, N. H., Leino, M., Tarkoma, S., and Virkki, J. (2026). Creation of ai-driven smart spaces for enhanced indoor environments–a survey. Internet of Things, page 101876.
Youngblood, G. M., Heierman, E. O., Holder, L. B., and Cook, D. J. (2005). Automation intelligence for the smart environment. In International Joint Conference On Artificial Intelligence, volume 19, page 1513. LAWRENCE ERLBAUM ASSOCIATES LTD.
Zhou, P., Ma, H., Zou, B., Zhang, X., Zhao, S., Lin, Y., Wang, Y., Feng, L., and Wang, G. (2023). A conceptual framework of cognitive-affective theory of mind: towards a precision identification of mental disorders. npj Mental Health Research, 2(1):12.
Chen, B., Vondrick, C., and Lipson, H. (2021). Visual behavior modelling for robotic theory of mind. Scientific Reports, 11(1):424.
Cook, D. J. and Das, S. K. (2007). How smart are our environments? an updated look at the state of the art. Pervasive and mobile computing, 3(2):53–73.
da Silva, H. H., Rocha, M., Trajano, G., Morales, A. S., Sarkadi, S., and Panisson, A. R. (2024). Distributed theory of mind in multi-agent systems. In Rocha, A. P., Steels, L., and van den Herik, H. J., editors, Proceedings of the 16th International Conference on Agents and Artificial Intelligence, ICAART 2024, Volume 1, Rome, Italy, February 24-26, 2024, pages 451–460. SCITEPRESS.
Fischer, T. and Demiris, Y. (2019). Computational modeling of embodied visual perspective taking. IEEE Transactions on Cognitive and Developmental Systems, 12(4):723–732.
Goertzel, B. (2014). Artificial general intelligence: Concept, state of the art, and future prospects. Journal of Artificial General Intelligence, 5(1):1.
Goldman, A. I. (2012). Theory of Mind. Oxford University Press, United Kingdom.
Hoffmann, S., Fernandes, A. R., Muchulski, V. W., Sarkadi, S., von Wangenheim, A., and Panisson, A. R. (2026a). Towards Intelligent Monitoring System Using Computer Vision. In In Proceedings of the International Conference on Agents and Artificial Intelligence (ICAART).
Hoffmann, S., Trajano, G., Sarkadi, S., and Panisson, A. R. (2025). Visual Theory of Mind (VToM): A Systematic Review. In IEEE/WIC International Conference on Web Intelligence and Intelligent Agent Technology (WI-IAT).
Hoffmann, S., Vieira, L. N., Sarkadi, S., and Panisson, A. R. (2026b). Visual theory of mind through llm-based semantic extraction. ToM4AI 2026, page 118.
Jin, C., Wu, Y., Cao, J., Xiang, J., Kuo, Y.-L., Hu, Z., Ullman, T. D., Torralba, A., Tenenbaum, J. B., and Shu, T. (2024). Mmtom-qa: Multimodal theory of mind question answering. Proceedings of the Annual Meeting of the Association for Computational Linguistics, 1:16077 – 16102.
Jocher, G., Chaurasia, A., and Qiu, J. (2023). Ultralytics yolov8.
Johnson, M. and Demiris, Y. (2007). Visuo-cognitive perspective taking for action recognition. In Int. Symp. Imitation Animimals Artifacts, pages 262–269.
Krosnick, J. A. (1999). Maximizing questionnaire quality. Measures of political attitudes, 2:37–58.
Likert, R. (1932). A technique for the measurement of attitudes. Archives of psychology.
Lugaresi, C., Tang, J., Nash, H., McClanahan, C., Uboweja, E., Hays, M., Zhang, F., Chang, C.-L., Yong, M., Lee, J., Chang, W.-T., Hua, W., Georg, M., and Grundmann, M. (2019). Mediapipe: A framework for perceiving and processing reality. In Third Workshop on Computer Vision for AR/VR at IEEE Computer Vision and Pattern Recognition (CVPR) 2019.
Luo, X., Liu, D., Dang, F., and Luo, H. (2024). Integration of llms and the physical world: Research and application. In Proceedings of the ACM Turing Award Celebration Conference-China 2024, pages 1–5.
Marin, L., Sisbot, E. A., and Alami, R. (2008). Geometric tools for perspective taking for human-robot interaction. In Mexican international conference on artificial intelligence (MICAI 2008), Mexico City, Mexico.
Polo-Rodríguez, A., Fiorini, L., Rovini, E., Cavallo, F., and Medina-Quero, J. (2025). Enhancing smart environments with context-aware chatbots using large language models. arXiv preprint arXiv:2502.14469.
Premack, D. and Woodruff, G. (1978). Does the chimpanzee have a theory of mind? Behavioral and brain sciences, 1(4):515–526.
Rivkin, D., Hogan, F., Feriani, A., Konar, A., Sigal, A., Liu, X., and Dudek, G. (2024). Aiot smart home via autonomous llm agents. IEEE Internet of Things Journal.
Rocha, M., da Silva, H. H., Morales, A. S., Sarkadi, S., and Panisson, A. R. (2023). Applying theory of mind to multi-agent systems: A systematic review. In Brazilian Conference on Intelligent Systems, pages 367–381. Springer.
Varol, A., Motlagh, N. H., Leino, M., Tarkoma, S., and Virkki, J. (2026). Creation of ai-driven smart spaces for enhanced indoor environments–a survey. Internet of Things, page 101876.
Youngblood, G. M., Heierman, E. O., Holder, L. B., and Cook, D. J. (2005). Automation intelligence for the smart environment. In International Joint Conference On Artificial Intelligence, volume 19, page 1513. LAWRENCE ERLBAUM ASSOCIATES LTD.
Zhou, P., Ma, H., Zou, B., Zhang, X., Zhao, S., Lin, Y., Wang, Y., Feng, L., and Wang, G. (2023). A conceptual framework of cognitive-affective theory of mind: towards a precision identification of mental disorders. npj Mental Health Research, 2(1):12.
Publicado
08/06/2026
Como Citar
HOFFMANN, Sandy; SARKADI, Stefan; PANISSON, Alison R..
Visual Theory of Mind for Human-Agent Collaboration in Smart Environments. In: SIMPÓSIO BRASILEIRO DE SISTEMAS COLABORATIVOS (SBSC), 21. , 2026, Porto Alegre/RS.
Anais [...].
Porto Alegre: Sociedade Brasileira de Computação,
2026
.
p. 54-67.
ISSN 2326-2842.
DOI: https://doi.org/10.5753/sbsc.2026.20063.
