Text-driven Video Acceleration

Washington L. S. Ramos; Leandro Soriano Marcolino; Erickson R. Nascimento

doi:10.5753/sibgrapi.est.2024.31642

Washington L. S. Ramos UFMG
Leandro Soriano Marcolino Lancaster University
Erickson R. Nascimento UFMG

DOI: https://doi.org/10.5753/sibgrapi.est.2024.31642

Resumo

From the dawn of the digital revolution until today, data has grown exponentially, especially in images and videos. Smartphones and wearable devices with high storage and long battery life contribute to continuous recording and massive uploads to social media. This rapid increase in visual data, combined with users’ limited time, demands methods to produce shorter videos that convey the same information. Semantic Fast-Forwarding reduces viewing time by adaptively accelerating videos and slowing down for relevant segments. However, current methods require predefined visual concepts or user supervision, which is costly and time-consuming. This work explores using textual data to create text-driven fast-forwarding methods that generate semantically meaningful videos without explicit user input. Our proposed approaches outperform baselines, achieving F1 Score improvements up to 12.8 percentage points over the best competitors. Comprehensive user and ablation studies, along with quantitative and qualitative evaluations, confirm their superiority. Visual results are available at https://youtu.be/cOYqumJQOY and https://youtu.be/u6ODTv7-9C4.

Referências

J. Kopf, M. F. Cohen, and R. Szeliski, “First-person hyper-lapse videos,” Proc. of the ACM Trans. on Graph., vol. 33, no. 4, 2014.

Y. Poleg, T. Halperin, C. Arora, and S. Peleg, “EgoSampling: Fast-forward and stereo for egocentric videos,” in Proc. of the IEEE Conf. on Comp. Vis. and Patt. Rec. (CVPR), 2015, pp. 4768–4776.

N. Joshi, W. Kienzle, M. Toelle, M. Uyttendaele, and M. F. Cohen, “Real-time hyperlapse creation via optimal frame selection,” Proc. of the ACM Trans. on Graph., vol. 34, no. 4, 2015.

T. Halperin, Y. Poleg, C. Arora, and S. Peleg, “EgoSampling: Wide view hyperlapse from egocentric videos,” IEEE Trans. on Circuits and Sys. for Video Technology, vol. 28, no. 5, pp. 1248–1259, 2018.

P. Rani, A. Jangid, V. P. Namboodiri, and K. S. Venkatesh, “Visual odometry based omni-directional hyperlapse,” in Proc. of the National Conf. on Comp. Vis., Patt. Rec., Image Proc., and Graph., Singapore, 2018, pp. 3–13.

M. Wang, J. Liang, S. Zhang, S. Lu, A. Shamir, and S. Hu, “Hyperlapse from multiple spatially-overlapping videos,” IEEE Trans. on Image Proc., vol. 27, no. 4, pp. 1735–1747, 2018.

M. Okamoto and K. Yanai, “Summarization of egocentric moving videos for generating walking route guidance,” in Proc. of the Pacific-Rim Symposium on Image and Video Technology, 2013, pp. 431–442.

S. Lan, R. Panda, Q. Zhu, and A. K. Roy-Chowdhury, “FFNet: Video fast-forwarding via reinforcement learning,” in Proc. of the IEEE/CVF Conf. on Comp. Vis. and Patt. Rec. (CVPR), 2018, pp. 6771–6780.

W. L. S. Ramos, M. M. Silva, M. F. M. Campos, and E. R. Nascimento, “Fast-forward video based on semantic extraction,” in Proc. of the IEEE Int. Conf. on Image Proc. (ICIP), 2016, pp. 3334–3338.

M. M. Silva, W. L. S. Ramos, J. P. K. Ferreira, M. F. M. Campos, and E. R. Nascimento, “Towards semantic fast-forward and stabilized egocentric videos,” in Proc. of the Eur. Conf. on Comp. Vis. Workshop (ECCVW), 2016, pp. 557–571.

W.-S. Lai, Y. Huang, N. Joshi, C. Buehler, M.-H. Yang, and S. B. Kang, “Semantic-driven generation of hyperlapse from 360◦ video,” IEEE Trans. on Vis. and Com. Graph., vol. 24, no. 9, pp. 2610–2621, 2017.

M. M. Silva, W. L. Ramos, F. C. Chamone, J. P. Ferreira, M. F. Campos, and E. R. Nascimento, “Making a long story short: A multi-importance fast-forwarding egocentric videos with the emphasis on relevant objects,” Journal of Vis. Comm. and Image Rep., vol. 53, pp. 55–64, 2018.

M. Silva, W. Ramos, J. Ferreira, F. Chamone, M. Campos, and E. R. Nascimento, “A weighted sparse sampling and smoothing frame transition approach for semantic fast-forward first-person videos,” in Proc. of the IEEE/CVF Conf. on Comp. Vis. and Patt. Rec. (CVPR), 2018, pp. 2383–2392.

M. Silva, W. Ramos, M. Campos, and E. R. Nascimento, “A sparse sampling-based framework for semantic fast-forward of first-person videos,” IEEE Trans. on Patt. Anal. and Mach. Intel. (TPAMI), vol. 43, no. 4, pp. 1438–1444, 2021.

S. Lan, Z. Wang, A. K. Roy-Chowdhury, E. Wei, and Q. Zhu, “Distributed multi-agent video fast-forwarding,” in Proc. of the 28th ACM Int. Conf. on Multim., ser. MM ’20, New York, NY, USA, 2020, pp. 1075–1084.

D. de Matos, W. Ramos, L. Romanhol, and E. R. Nascimento, “Musical hyperlapse: A multimodal approach to accelerate first-person videos,” in 2021 34th SIBGRAPI Conf. on Graphics, Patt. and Images (SIBGRAPI), 2021, pp. 184–191.

S. Lan, Z. Wang, E. Wei, A. K. Roy-Chowdhury, and Q. Zhu, “Collaborative multi-agent video fast-forwarding,” IEEE Trans. on Multim., pp. 1–14, 2023.

D. de Matos, W. Ramos, M. Silva, L. Romanhol, and E. R. Nascimento, “A multimodal hyperlapse method based on video and songs’ emotion alignment,” Patt. Rec. Letters, vol. 166, pp. 174–181, 2023.

H. Shvaytser, “Learnable and nonlearnable visual concepts,” IEEE Trans. on Patt. Anal. and Mach. Intel. (TPAMI), vol. 12, no. 5, pp. 459–466, 1990.

T. Li and L. Wang, “Learning spatiotemporal features via video and text pair discrimination,” CoRR, vol. abs/2001.05691, 2021.

Y. J. Lee, J. Ghosh, and K. Grauman, “Discovering important people and objects for egocentric video summarization,” in Proc. of the IEEE Conf. on Comp. Vis. and Patt. Rec. (CVPR), 2012, pp. 1346–1353.

K. Zhou, Y. Qiao, and T. Xiang, “Deep reinforcement learning for unsupervised video summarization with diversity-representativeness reward,” in Proc. of the AAAI Conf. on Artif. Intel., 2018, pp. 7582–7589.

T. Yao, T. Mei, and Y. Rui, “Highlight detection with pairwise deep ranking for first-person video summarization,” in Proc. of the IEEE Conf. on Comp. Vis. and Patt. Rec. (CVPR), 2016, pp. 982–990.

P. Varini, G. Serra, and R. Cucchiara, “Personalized egocentric video summarization of cultural tour on user preferences input,” IEEE Trans. on Multimedia, vol. 19, no. 12, pp. 2832–2845, 2017.

A. Sharghi, J. S. Laurel, and B. Gong, “Query-focused video summarization: Dataset, evaluation, and a memory network based approach,” in Proc. of the IEEE Conf. on Comp. Vis. and Patt. Rec. (CVPR), 2017, pp. 4788–4797.

M. Rochan and Y. Wang, “Video summarization by learning from unpaired data,” in Proc. of the IEEE/CVF Conf. on Comp. Vis. and Patt. Rec. (CVPR), 2019, pp. 7894–7903.

Z. Li and L. Yang, “Weakly supervised deep reinforcement learning for video summarization with semantically meaningful reward,” in IEEE Winter Conf. on App. of Comp. Vis. (WACV), 2021, pp. 3239–3247.

H. Jiang and Y. Mu, “Joint video summarization and moment localization by cross-task sample transfer,” in Proc. of the IEEE/CVF Conf. on Comp. Vis. and Patt. Rec. (CVPR), 2022, pp. 16 388–16 398.

K. Higuchi, R. Yonetani, and Y. Sato, “EgoScanning: Quickly scanning first-person videos with egocentric elastic timelines,” in Proc. of the Conf. on Human Factors in Computing Sys. (CHI), ser. CHI ’17, New York, NY, USA, 2017, pp. 6536–6546.

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry et al., “Learning transferable visual models from natural language supervision,” Image, vol. 2, p. T2, 2021.

M. Bain, A. Nagrani, G. Varol, and A. Zisserman, “Frozen in time: A joint video and image encoder for end-to-end retrieval,” in Proc. of the IEEE/CVF Int. Conf. on Comp. Vis. (ICCV), 2021, pp. 1728–1738.

X. Wang, J. Wu, J. Chen, L. Li, Y. Wang, and W. Y. Wang, “VaTeX: A large-scale, high-quality multilingual dataset for video-and-language research,” in Proc. of the IEEE/CVF Int. Conf. on Comp. Vis. (ICCV), 2019, pp. 4580–4590.

X. Jiang, X. Xu, J. Zhang, F. Shen, Z. Cao, and H. T. Shen, “Semi-supervised video paragraph grounding with contrastive encoder,” in Proc. of the IEEE/CVF Conf. on Comp. Vis. and Patt. Rec. (CVPR), 2022, pp. 2466–2475.

J. Johnson, A. Karpathy, and L. Fei-Fei, “DenseCap: Fully convolutional localization networks for dense captioning,” in Proc. of the IEEE Conf. on Comp. Vis. and Patt. Rec. (CVPR), 2016, pp. 4565–4574.

W. Wang, J. Shen, and L. Shao, “Video salient object detection via fully convolutional networks,” IEEE Trans. on Image Proc., vol. 27, no. 1, pp. 38–49, 2018.

Q. Li, S. Shah, X. Liu, and A. Nourbakhsh, “Data sets: Word embeddings learned from tweets and general data,” CoRR, vol. abs/1708.03994, 2017.

R. J. Williams, “Simple statistical gradient-following algorithms for connectionist reinforcement learning,” Machine Learning, vol. 8, no. 3, pp. 229–256, 1992.

L. Zhou, C. Xu, and J. Corso, “Towards automatic learning of procedures from web instructional videos,” in Proc. of the AAAI Conf. on Artif. Intel., 2018, pp. 7590–7598.

Y. Tang, J. Lu, and J. Zhou, “Comprehensive instructional video analysis: The coin dataset and performance evaluation,” IEEE Trans. on Patt. Anal. and Mach. Intel. (TPAMI), pp. 1–1, 2020.

T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft COCO: Common objects in context,” in Proc. of the Eur. Conf. on Comp. Vis. (ECCV), 2014, pp. 740–755.

W. Ramos, M. Silva, E. Araujo, V. Moura, K. Oliveira, L. S. Marcolino, and E. R. Nascimento, “Text-driven video acceleration: A weakly-supervised reinforcement learning method,” IEEE Trans. on Patt. Anal. and Mach. Intel. (TPAMI), vol. 45, no. 2, pp. 2492–2504, 2023.

W. Ramos, M. Silva, E. Araujo, L. S. Marcolino, and E. Nascimento, “Straight to the point: Fast-forwarding videos via reinforcement learning using textual data,” in Proc. of the IEEE/CVF Conf. on Comp. Vis. and Patt. Rec. (CVPR), 2020, pp. 10 928–10 937.

W. L. S. Ramos, M. M. Silva, E. R. Araujo, A. C. Neves, and E. R. Nascimento, “Personalizing fast-forward videos based on visual and textual features from social network,” in IEEE Winter Conf. on App. of Comp. Vis. (WACV), 2020, pp. 3260–3269.

Text-driven Video Acceleration

Resumo

Referências

Artigos mais lidos do(s) mesmo(s) autor(es)