Automated Attention Guidance in Virtual Reality Videos
Resumo
Despite their immersive nature, 360° virtual reality (VR) videos often lack effective attention guidance, leading to user disorientation and missed information. This work proposes a novel method integrating computational vision with natural language processing to automatically guide user attention in 360° VR. It leverages natural language roadmaps to identify and track key elements, applying dynamic visual effects. The comparative evaluation identified Grounding DINO as a particularly suitable detector, while DAM4SAM and Segment Anything 2 (SAM 2) demonstrated strong performance for tracking. Demonstrated on a 360° VR tour, this approach can significantly enhance user experience and comprehension, advancing automated attention guidance for immersive content.
Palavras-chave:
360º Videos, Virtual Reality, Attention Guidance, Deep Learning
Referências
P. V. S. Silva, A. R. S. Vitoria, D. F. C. Silva, and A. R. G. Filho, "Attention guidance through video script: A case study of object focusing on 360º vr video tours," in Proceedings of the 26th Symposium on Virtual and Augmented Reality, 2024, pp. 247–251.
X. Liu, Q. Xiao, V. Gopalakrishnan, B. Han, F. Qian, and M. Varvello, "360 innovations for panoramic video streaming," in Proceedings of the 16th ACM Workshop on Hot Topics in Networks, 2017, pp. 50–56.
A. MacQuarrie and A. Steed, "Cinematic virtual reality: Evaluating the effect of display type on the viewing experience for panoramic video," in 2017 IEEE Virtual Reality (VR). IEEE, 2017, pp. 45–54.
A. van Hoff, "Virtual reality and the future of immersive entertainment," in Proceedings of the 2017 ACM International Conference on Interactive Experiences for TV and Online Video, 2017, pp. 129–129.
S. Dutta, S. Dixit, and A. Khare, "Examining 360° video tourist experiences and adoption in a developing country," Qualitative Market Research: An International Journal, 2024.
H. Choi and S. Nam, "A study on attention attracting elements of 360-degree videos based on vr eye-tracking system," Multimodal Technologies and Interaction, vol. 6, no. 7, 2022. [Online]. Available: [link]
C. Maranes, D. Gutierrez, and A. Serrano, "Exploring the impact of 360 movie cuts in users’ attention," in 2020 IEEE Conference on Virtual Reality and 3D User Interfaces (VR). IEEE, 2020, pp. 73–82.
M. Speicher, C. Rosenberg, D. Degraen, F. Daiber, and A. Kruger, "Exploring visual guidance in 360-degree videos," in Proceedings of the 2019 ACM International Conference on Interactive Experiences for TV and Online Video, 2019, pp. 1–12.
A. Schmitz, A. MacQuarrie, S. Julier, N. Binetti, and A. Steed, "Directing versus attracting attention: Exploring the effectiveness of central and peripheral cues in panoramic videos," in 2020 IEEE Conference on Virtual Reality and 3D User Interfaces (VR). IEEE, 2020, pp. 63–72.
P. V. S. Silva, L. L. Neves, D. F. Silva, R. T. Sousa, and A. R. Galvao Filho, "Focus360: Guiding user attention in immersive videos for vr," in 2025 IEEE Conference on Virtual Reality and 3D User Interfaces Abstracts and Workshops (VRW). IEEE, 2025, pp. 1634–1635.
F. Danieau, A. Guillo, and R. Dore, "Attention guidance for immersive video content in head-mounted displays," in 2017 IEEE Virtual Reality (VR), 2017, pp. 205–206.
S. Hillaire, A. Lecuyer, R. Cozot, and G. Casiez, "Depth-of-field blur effects for first-person navigation in virtual environments," in Proceedings of the 2007 ACM Symposium on Virtual Reality Software and Technology, ser. VRST ’07. New York, NY, USA: Association for Computing Machinery, 2007, p. 203–206. [Online]. DOI: 10.1145/1315184.1315223
J. O. Wallgrun, M. M. Bagher, P. Sajjadi, and A. Klippel, "A comparison of visual attention guiding approaches for 360° image-based vr tours," in 2020 IEEE Conference on Virtual Reality and 3D User Interfaces (VR), 2020, pp. 83–91.
J. W. Woodworth, A. Yoshimura, N. G. Lipari, and C. W. Borst, "Design and evaluation of visual cues for restoring and guiding visual attention in eye-tracked vr," in 2023 IEEE Conference on Virtual Reality and 3D User Interfaces Abstracts and Workshops (VRW), 2023, pp. 442–450.
S. Hillaire, A. Lecuyer, R. Cozot, and G. Casiez, "Using an eye-tracking system to improve camera motions and depth-of-field blur effects in virtual environments," in 2008 IEEE Virtual Reality Conference, 2008, pp. 47–50.
T. Cheng, L. Song, Y. Ge, W. Liu, X. Wang, and Y. Shan, "Yolo-world: Real-time open-vocabulary object detection," 2024. [Online]. Available: [link]
J. Terven, D.-M. Cordova-Esparza, and J.-A. Romero-Gonzalez, "A comprehensive review of yolo architectures in computer vision: From yolov1 to yolov8 and yolo-nas," Machine Learning and Knowledge Extraction, vol. 5, no. 4, p. 1680–1716, Nov. 2023. [Online]. DOI: 10.3390/make5040083
N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko, "End-to-end object detection with transformers," 2020. [Online]. Available: [link]
M. Caron, H. Touvron, I. Misra, H. Jegou, J. Mairal, P. Bojanowski, and A. Joulin, "Emerging properties in self-supervised vision transformers," 2021. [Online]. Available: [link]
S. Liu, Z. Zeng, T. Ren, F. Li, H. Zhang, J. Yang, C. Li, J. Yang, H. Su, J. Zhu et al. "Grounding dino: Marrying dino with grounded pre-training for open-set object detection," arXiv preprint arXiv:2303.05499, 2023.
M. Minderer, A. Gritsenko, and N. Houlsby, "Scaling open-vocabulary object detection," Advances in Neural Information Processing Systems, vol. 36, pp. 72983–73007, 2023.
N. Ravi, V. Gabeur, Y.-T. Hu, R. Hu, C. Ryali, T. Ma, H. Khedr, R. Radle, C. Rolland, L. Gustafson et al. "Sam 2: Segment anything in images and videos," arXiv preprint arXiv:2408.00714, 2024.
A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y. Lo et al. "Segment anything," in Proceedings of the IEEE/CVF international conference on computer vision, 2023, pp. 4015–4026.
L. Ke, M. Ye, M. Danelljan, Y.-W. Tai, C.-K. Tang, F. Yu et al. "Segment anything in high quality," Advances in Neural Information Processing Systems, vol. 36, pp. 29914–29934, 2023.
Y. Xiong, C. Zhou, X. Xiang, L. Wu, C. Zhu, Z. Liu, S. Suri, B. Varadarajan, R. Akula, F. Iandola et al. "Efficient track anything," arXiv preprint arXiv:2411.18933, 2024.
J. Videnovic, A. Lukezic, and M. Kristan, "A distractor-aware memory for visual object tracking with sam2," in Proceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 24255–24264.
A. Grattafiori, A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Vaughan et al. "The llama 3 herd of models," arXiv preprint arXiv:2407.21783, 2024.
T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollar, and C. L. Zitnick, "Microsoft coco: Common objects in context," in Computer vision–ECCV 2014: 13th European conference, zurich, Switzerland, September 6-12, 2014, proceedings, part v 13. Springer, 2014, pp. 740–755.
S. Shao, Z. Li, T. Zhang, C. Peng, G. Yu, X. Zhang, J. Li, and J. Sun, "Objects365: A large-scale, high-quality dataset for object detection," in Proceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 8430–8439.
A. Kamath, M. Singh, Y. LeCun, G. Synnaeve, I. Misra, and N. Carion, "Mdetr-modulated detection for end-to-end multi-modal understanding," in Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 1780–1790.
K. He, X. Chen, S. Xie, Y. Li, P. Dollar, and R. Girshick, "Masked autoencoders are scalable vision learners," in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 16000–16009.
C. Ryali, Y.-T. Hu, D. Bolya, C. Wei, H. Fan, P.-Y. Huang, V. Aggarwal, A. Chowdhury, O. Poursaeed, J. Hoffman et al. "Hiera: A hierarchical vision transformer without the bells-and-whistles," in International conference on machine learning. PMLR, 2023, pp. 29441–29454.
D. Bolya, C. Ryali, J. Hoffman, and C. Feichtenhofer, "Window attention is bugged: how not to interpolate position embeddings," arXiv preprint arXiv:2311.05613, 2023.
X. Liu, Q. Xiao, V. Gopalakrishnan, B. Han, F. Qian, and M. Varvello, "360 innovations for panoramic video streaming," in Proceedings of the 16th ACM Workshop on Hot Topics in Networks, 2017, pp. 50–56.
A. MacQuarrie and A. Steed, "Cinematic virtual reality: Evaluating the effect of display type on the viewing experience for panoramic video," in 2017 IEEE Virtual Reality (VR). IEEE, 2017, pp. 45–54.
A. van Hoff, "Virtual reality and the future of immersive entertainment," in Proceedings of the 2017 ACM International Conference on Interactive Experiences for TV and Online Video, 2017, pp. 129–129.
S. Dutta, S. Dixit, and A. Khare, "Examining 360° video tourist experiences and adoption in a developing country," Qualitative Market Research: An International Journal, 2024.
H. Choi and S. Nam, "A study on attention attracting elements of 360-degree videos based on vr eye-tracking system," Multimodal Technologies and Interaction, vol. 6, no. 7, 2022. [Online]. Available: [link]
C. Maranes, D. Gutierrez, and A. Serrano, "Exploring the impact of 360 movie cuts in users’ attention," in 2020 IEEE Conference on Virtual Reality and 3D User Interfaces (VR). IEEE, 2020, pp. 73–82.
M. Speicher, C. Rosenberg, D. Degraen, F. Daiber, and A. Kruger, "Exploring visual guidance in 360-degree videos," in Proceedings of the 2019 ACM International Conference on Interactive Experiences for TV and Online Video, 2019, pp. 1–12.
A. Schmitz, A. MacQuarrie, S. Julier, N. Binetti, and A. Steed, "Directing versus attracting attention: Exploring the effectiveness of central and peripheral cues in panoramic videos," in 2020 IEEE Conference on Virtual Reality and 3D User Interfaces (VR). IEEE, 2020, pp. 63–72.
P. V. S. Silva, L. L. Neves, D. F. Silva, R. T. Sousa, and A. R. Galvao Filho, "Focus360: Guiding user attention in immersive videos for vr," in 2025 IEEE Conference on Virtual Reality and 3D User Interfaces Abstracts and Workshops (VRW). IEEE, 2025, pp. 1634–1635.
F. Danieau, A. Guillo, and R. Dore, "Attention guidance for immersive video content in head-mounted displays," in 2017 IEEE Virtual Reality (VR), 2017, pp. 205–206.
S. Hillaire, A. Lecuyer, R. Cozot, and G. Casiez, "Depth-of-field blur effects for first-person navigation in virtual environments," in Proceedings of the 2007 ACM Symposium on Virtual Reality Software and Technology, ser. VRST ’07. New York, NY, USA: Association for Computing Machinery, 2007, p. 203–206. [Online]. DOI: 10.1145/1315184.1315223
J. O. Wallgrun, M. M. Bagher, P. Sajjadi, and A. Klippel, "A comparison of visual attention guiding approaches for 360° image-based vr tours," in 2020 IEEE Conference on Virtual Reality and 3D User Interfaces (VR), 2020, pp. 83–91.
J. W. Woodworth, A. Yoshimura, N. G. Lipari, and C. W. Borst, "Design and evaluation of visual cues for restoring and guiding visual attention in eye-tracked vr," in 2023 IEEE Conference on Virtual Reality and 3D User Interfaces Abstracts and Workshops (VRW), 2023, pp. 442–450.
S. Hillaire, A. Lecuyer, R. Cozot, and G. Casiez, "Using an eye-tracking system to improve camera motions and depth-of-field blur effects in virtual environments," in 2008 IEEE Virtual Reality Conference, 2008, pp. 47–50.
T. Cheng, L. Song, Y. Ge, W. Liu, X. Wang, and Y. Shan, "Yolo-world: Real-time open-vocabulary object detection," 2024. [Online]. Available: [link]
J. Terven, D.-M. Cordova-Esparza, and J.-A. Romero-Gonzalez, "A comprehensive review of yolo architectures in computer vision: From yolov1 to yolov8 and yolo-nas," Machine Learning and Knowledge Extraction, vol. 5, no. 4, p. 1680–1716, Nov. 2023. [Online]. DOI: 10.3390/make5040083
N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko, "End-to-end object detection with transformers," 2020. [Online]. Available: [link]
M. Caron, H. Touvron, I. Misra, H. Jegou, J. Mairal, P. Bojanowski, and A. Joulin, "Emerging properties in self-supervised vision transformers," 2021. [Online]. Available: [link]
S. Liu, Z. Zeng, T. Ren, F. Li, H. Zhang, J. Yang, C. Li, J. Yang, H. Su, J. Zhu et al. "Grounding dino: Marrying dino with grounded pre-training for open-set object detection," arXiv preprint arXiv:2303.05499, 2023.
M. Minderer, A. Gritsenko, and N. Houlsby, "Scaling open-vocabulary object detection," Advances in Neural Information Processing Systems, vol. 36, pp. 72983–73007, 2023.
N. Ravi, V. Gabeur, Y.-T. Hu, R. Hu, C. Ryali, T. Ma, H. Khedr, R. Radle, C. Rolland, L. Gustafson et al. "Sam 2: Segment anything in images and videos," arXiv preprint arXiv:2408.00714, 2024.
A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y. Lo et al. "Segment anything," in Proceedings of the IEEE/CVF international conference on computer vision, 2023, pp. 4015–4026.
L. Ke, M. Ye, M. Danelljan, Y.-W. Tai, C.-K. Tang, F. Yu et al. "Segment anything in high quality," Advances in Neural Information Processing Systems, vol. 36, pp. 29914–29934, 2023.
Y. Xiong, C. Zhou, X. Xiang, L. Wu, C. Zhu, Z. Liu, S. Suri, B. Varadarajan, R. Akula, F. Iandola et al. "Efficient track anything," arXiv preprint arXiv:2411.18933, 2024.
J. Videnovic, A. Lukezic, and M. Kristan, "A distractor-aware memory for visual object tracking with sam2," in Proceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 24255–24264.
A. Grattafiori, A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Vaughan et al. "The llama 3 herd of models," arXiv preprint arXiv:2407.21783, 2024.
T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollar, and C. L. Zitnick, "Microsoft coco: Common objects in context," in Computer vision–ECCV 2014: 13th European conference, zurich, Switzerland, September 6-12, 2014, proceedings, part v 13. Springer, 2014, pp. 740–755.
S. Shao, Z. Li, T. Zhang, C. Peng, G. Yu, X. Zhang, J. Li, and J. Sun, "Objects365: A large-scale, high-quality dataset for object detection," in Proceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 8430–8439.
A. Kamath, M. Singh, Y. LeCun, G. Synnaeve, I. Misra, and N. Carion, "Mdetr-modulated detection for end-to-end multi-modal understanding," in Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 1780–1790.
K. He, X. Chen, S. Xie, Y. Li, P. Dollar, and R. Girshick, "Masked autoencoders are scalable vision learners," in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 16000–16009.
C. Ryali, Y.-T. Hu, D. Bolya, C. Wei, H. Fan, P.-Y. Huang, V. Aggarwal, A. Chowdhury, O. Poursaeed, J. Hoffman et al. "Hiera: A hierarchical vision transformer without the bells-and-whistles," in International conference on machine learning. PMLR, 2023, pp. 29441–29454.
D. Bolya, C. Ryali, J. Hoffman, and C. Feichtenhofer, "Window attention is bugged: how not to interpolate position embeddings," arXiv preprint arXiv:2311.05613, 2023.
Publicado
30/09/2025
Como Citar
SILVA, Paulo Vitor Santana; NEVES, Lucas Lima; GOIÁS, Rafael Alves; SILVA, Diogo Fernandes Costa; SOUSA, Rafael Teixeira; FILHO, Arlindo Rodrigues Galvão.
Automated Attention Guidance in Virtual Reality Videos. In: SIMPÓSIO DE REALIDADE VIRTUAL E AUMENTADA (SVR), 27. , 2025, Salvador/BA.
Anais [...].
Porto Alegre: Sociedade Brasileira de Computação,
2025
.
p. 110-119.
