Attention Guidance through Video Script: A Case Study of Object Focusing on 360º VR Video Tours

Paulo Vitor Santana Silva; Arthur Ricardo Sousa Vitória; Diogo Fernandes Costa Silva; Arlindo Rodrigues Galvão Filho

Paulo Vitor Santana Silva AKCIT Universidade Federal de Goiás http://orcid.org/0009-0003-0931-3695
Arthur Ricardo Sousa Vitória AKCIT Universidade Federal de Goiás https://orcid.org/0000-0003-1746-9668
Diogo Fernandes Costa Silva AKCIT Universidade Federal de Goiás https://orcid.org/0000-0002-1507-4550
Arlindo Rodrigues Galvão Filho AKCIT Universidade Federal de Goiás http://orcid.org/0000-0003-2151-8039

Resumo

Within the expansive domain of virtual reality (VR), 360º VR videos immerse viewers in a spherical environment, allowing them to explore and interact with the virtual world from all angles. While this video representation offers unparalleled levels of immersion, it often lacks effective methods to guide viewers’ attention toward specific elements within the virtual environment. This paper combines the models Grounding Dino and Segment Anything (SAM) to guide attention by object focusing based on video scripts. As a case study, this work conducts the experiments on a 360º video tour on the University of Reading. The experiment results show that video scripts can improve the user experience in 360º VR Videos Tour by helping in the task of directing the user’s attention.

Palavras-chave: 360º Videos, Attention Guidance, Deep Learning

Referências

Haram Choi and Sanghun Nam. 2022. A Study on Attention Attracting Elements of 360-Degree Videos Based on VR Eye-Tracking System. Multimodal Technologies and Interaction, 6(7). DOI: 10.3390/mti6070054

Fabien Danieau, Antoine Guillo, and Renaud Doré. 2017. Attention guidance for immersive video content in head-mounted displays. In 2017 IEEE Virtual Reality (VR). IEEE, 205–206.

Esther Guervós, Jaime Jesús Ruiz Alonso, Pablo Pérez García, Juan Alberto Muñoz, César Díaz Martín, and Narciso García Santos. 2019. Using 360 VR video to improve the learning experience in veterinary medicine university degree. (2019).

Romain Christian Herault, Alisa Lincke, Marcelo Milrad, Elin-Sofie Forsgärde, and Carina Elmqvist. 2018. Using 360-degrees interactive videos in patient trauma treatment education: design, development and evaluation aspects. Smart Learning Environments, 5(1), 26.

Sébastien Hillaire, Anatole Lécuyer, Rémi Cozot, and Géry Casiez. 2007. Depth-of-field blur effects for first-person navigation in virtual environments. In Proceedings of the 2007 ACM symposium on Virtual reality software and technology. 203–206.

Sébastien Hillaire, Anatole Lécuyer, Rémi Cozot, and Géry Casiez. 2008. Using an Eye-Tracking System to Improve Camera Motions and Depth-of-Field Blur Effects in Virtual Environments. In 2008 IEEE Virtual Reality Conference. 47–50. DOI: 10.1109/VR.2008.4480749

Aishwarya Kamath, Mannat Singh, Yann LeCun, Gabriel Synnaeve, Ishan Misra, and Nicolas Carion. 2021. Mdetr-modulated detection for end-to-end multimodal understanding. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 1780–1790.

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C. Berg, Wan-Yen Lo, Piotr Dollár, and Ross Girshick. 2023. Segment Anything. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). 4015–4026.

Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A. Shamma, et al. 2017. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International journal of computer vision, 123, 32–73.

Liunian Harold Li, Pengchuan Zhang, Haotian Zhang, Jianwei Yang, Chunyuan Li, Yiwu Zhong, Lijuan Wang, Lu Yuan, Lei Zhang, Jenq-Neng Hwang, Kai-Wei Chang, and Jianfeng Gao. 2022. Grounded Language-Image Pre-training. arXiv:2112.03857 [cs.CV].

Tsung-Yi Lin, Michael Maire, Serge Belongie, Lubomir Bourdev, Ross Girshick, James Hays, Pietro Perona, Deva Ramanan, C. Lawrence Zitnick, and Piotr Dollár. 2015. Microsoft COCO: Common Objects in Context. arXiv:1405.0312 [cs.CV].

Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Chunyuan Li, Jianwei Yang, Hang Su, Jun Zhu, et al. 2023. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. arXiv preprint arXiv:2303.05499 (2023).

Xing Liu, Qingyang Xiao, Vijay Gopalakrishnan, Bo Han, Feng Qian, and Matteo Varvello. 2017. 360 innovations for panoramic video streaming. In Proceedings of the 16th ACM Workshop on Hot Topics in Networks. 50–56.

Andrew MacQuarrie and Anthony Steed. 2017. Cinematic virtual reality: Evaluating the effect of display type on the viewing experience for panoramic video. In 2017 IEEE Virtual Reality (VR). IEEE, 45–54.

Carlos Marañes, Diego Gutierrez, and Ana Serrano. 2020. Exploring the impact of 360 movie cuts in users’ attention. In 2020 IEEE Conference on Virtual Reality and 3D User Interfaces (VR). IEEE, 73–82.

Bryan A. Plummer, Liwei Wang, Chris M. Cervantes, Juan C. Caicedo, Julia Hockenmaier, and Svetlana Lazebnik. 2015. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In Proceedings of the IEEE international conference on computer vision. 2641–2649.

Yeshwanth Pulijala, Minhua Ma, and Ashraf Ayoub. 2017. VR surgery: Interactive virtual reality application for training oral and maxillofacial surgeons using oculus rift and leap motion. Serious Games and Edutainment Applications: Volume II, 187–202.

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision. In International conference on machine learning. PMLR, 8748–8763.

Anastasia Schmitz, Andrew MacQuarrie, Simon Julier, Nicola Binetti, and Anthony Steed. 2020. Directing versus attracting attention: Exploring the effectiveness of central and peripheral cues in panoramic videos. In 2020 IEEE Conference on Virtual Reality and 3D User Interfaces (VR). IEEE, 63–72.

Shuai Shao, Zeming Li, Tianyuan Zhang, Chao Peng, Gang Yu, Xiangyu Zhang, Jing Li, and Jian Sun. 2019. Objects365: A Large-Scale, High-Quality Dataset for Object Detection. In 2019 IEEE/CVF International Conference on Computer Vision (ICCV). 8429–8438. DOI: 10.1109/ICCV.2019.00852

Marco Speicher, Christoph Rosenberg, Donald Degraen, Florian Daiber, and Antonio Krüger. 2019. Exploring visual guidance in 360-degree videos. In Proceedings of the 2019 ACM International Conference on Interactive Experiences for TV and Online Video. 1–12.

Arthur van Hoff. 2017. Virtual reality and the future of immersive entertainment. In Proceedings of the 2017 ACM International Conference on Interactive Experiences for TV and Online Video. 129–129.

Jan Oliver Wallgrün, Mahda M. Bagher, Pejman Sajjadi, and Alexander Klippel. 2020. A comparison of visual attention guiding approaches for 360 image-based vr tours. In 2020 IEEE Conference on Virtual Reality and 3D User Interfaces (VR). IEEE, 83–91.

Jason W. Woodworth, Andrew Yoshimura, Nicholas G. Lipari, and Christoph W. Borst. 2023. Design and Evaluation of Visual Cues for Restoring and Guiding Visual Attention in Eye-Tracked VR. In 2023 IEEE Conference on Virtual Reality and 3D User Interfaces Abstracts and Workshops (VRW). IEEE, 442–450.