Semantic Hyperlapse: a Sparse Coding-based and Multi-Importance Approach for First-Person Videos
Resumo
The availability of low-cost, high-quality personal wearable cameras combined with the unlimited storage capacity of video-sharing websites has evoked a growing interest in First-Person Videos (FPVs). Such videos are usually composed of long-running unedited streams captured by a device attached to the user body, which makes them tedious and visually unpleasant to watch. Consequently, there is a rise in the need to provide quick access to the information therein. To address this need, efforts have been applied to the development of techniques such as Hyperlapse and Semantic Hyperlapse, which aims to create visually pleasant shorter videos and emphasize semantic portions of the video, respectively. The state-of-the-art Semantic Hyperlapse method SSFF, negligees the level of importance of the relevant information, by only evaluating if it is significant or not. Other limitations of SSFF are the number of input parameters, the scalability in the number of visual features to describe the frames, and the abrupt change in the speed-up rate of consecutive video segments. In this dissertation, we propose a parameter-free Sparse Coding based methodology to adaptively fast-forward First-Person Videos, that emphasize the semantic portions applying a multi-importance approach. Experimental evaluations show that the proposed method creates shorter version video retaining more semantic information, with fewer abrupt transitions of speed-up rates, and more stable final videos than the output of SSFF. Visual results and graphical explanation of the methodology can be visualized through the link: https://youtu.be/8uStih8P5-Y.
Referências
A. G. del Molino, C. Tan, J. H. Lim, and A. H. Tan, “Summarization of egocentric videos: A comprehensive survey,” vol. 47, no. 1, Feb 2017, pp. 65–76. https://doi.org/10.1109/THMS.2016.2623480
Y. Poleg, T. Halperin, C. Arora, and S. Peleg, “Egosampling: Fast-forward and stereo for egocentric videos,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2015, pp. 4768–4776. https://doi.org/10.1109/CVPR.2015.7299109
J. Kopf, M. F. Cohen, and R. Szeliski, “First-person hyper-lapse videos,” ACM Trans. Graph., vol. 33, no. 4, pp. 78:1–78:10, Jul. 2014. https://doi.org/10.1145/2601097.2601195
N. Joshi, W. Kienzle, M. Toelle, M. Uyttendaele, and M. F. Cohen, “Real-time hyperlapse creation via optimal frame selection,” ACM Trans. Graph., vol. 34, no. 4, pp. 63:1–63:9, Jul. 2015. https://doi.org/10.1145/2766954
T. Halperin, Y. Poleg, C. Arora, and S. Peleg, “Egosampling: Wide view hyperlapse from egocentric videos,” IEEE Transactions on Circuits and Systems for Video Technology, vol. PP, no. 99, pp. 1–1, 2017. https://doi.org/10.1109/TCSVT.2017.2651051
M. Wang, J. Liang, S. Zhang, S. Lu, A. Shamir, and S. Hu, “Hyper-lapse from multiple spatially-overlapping videos,” IEEE Transactions on Image Processing (TIP), vol. 27, no. 4, pp. 1735–1747, April 2018. https://doi.org/10.1109/TIP.2017.2749143
W. L. S. Ramos, M. M. Silva, M. F. M. Campos, and E. R. Nascimento, “Fast-forward video based on semantic extraction,” in Proceedings of the IEEE International Conference on Image Processing (ICIP), Phoenix, AZ, USA, Sept 2016, pp. 3334–3338. https://doi.org/10.1109/ICIP.2016.7532977
M. M. Silva, W. L. S. Ramos, J. P. K. Ferreira, M. F. M. Campos, and E. R. Nascimento, “Towards semantic fast-forward and stabilized egocentric videos,” in Proceedings of the European Conference on Computer Vision Workshop (ECCVW). Amsterdam, NL: Springer International Publishing, October 2016, pp. 557–571.
B. A. Plummer, M. Brown, and S. Lazebnik, “Enhancing video summarization via vision-language embedding,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, USA, July 2017, pp. 1052–1060. https://doi.org/10.1109/CVPR.2017.118
Y. J. Lee, J. Ghosh, and K. Grauman, “Discovering important people and objects for egocentric video summarization,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2012, pp. 1346–1353. https://doi.org/10.1109/CVPR.2012.6247820
Z. Lu and K. Grauman, “Story-driven summarization for egocentric video,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2013, pp. 2714–2721. https://doi.org/10.1109/CVPR.2013.350
B. Xiong, G. Kim, and L. Sigal, “Storyline representation of egocentric videos with an applications to story-based search,” in Proceedings of the IEEE International Conference on Computer Vision (ICCV), Dec 2015, pp. 4525–4533. https://doi.org/10.1109/ICCV.2015.514
J. A. Yang, C. H. Lee, S. W. Yang, V. S. Somayazulu, Y. K. Chen, and S. Y. Chien, “Wearable social camera: Egocentric video summarization for social interaction,” in IEEE International Conference on Multimedia Expo Workshops, July 2016, pp. 1–6. https://doi.org/10.1109/ICMEW.2016.7574681
S. Lan, R. Panda, Q. Zhu, and A. K. Roy-Chowdhury, “Ffnet: Video fast-forwarding via reinforcement learning,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018, pp. 6771–6780. https://doi.org/10.1109/CVPR.2018.00708
Y. Cong, J. Yuan, and J. Luo, “Towards scalable summarization of consumer videos via sparse dictionary selection,” IEEE Transactions on Multimedia, vol. 14, no. 1, pp. 66–75, Feb 2012. https://doi.org/10.1109/TMM.2011.2166951
B. Zhao and E. P. Xing, “Quasi real-time summarization for consumer videos,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Columbus, USA, June 2014, pp. 2513–2520. https://doi.org/10.1109/CVPR.2014.322
J. Wright, A. Y. Yang, A. Ganesh, S. S. Sastry, and Y. Ma, “Robust face recognition via sparse representation,” IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), vol. 31, no. 2, pp. 210–227, Feb 2009. https://doi.org/10.1109/TPAMI.2008.79
B. Zhao, L. Fei-Fei, and E. P. Xing, “Online detection of unusual events in videos via dynamic sparse coding,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Colorado Springs, USA, 2011, pp. 3313–3320. https://doi.org/10.1109/CVPR.2011.5995524
G. Oliveira, E. Nascimento, A. Vieira, and M. Campos, “Sparse spatial coding: A novel approach to visual recognition,” IEEE Transactions on Image Processing (TIP), vol. 23, no. 6, pp. 2719–2731, June 2014. https://doi.org/10.1109/TIP.2014.2317988
S. Mei, G. Guan, Z. Wang, M. He, X. S. Hua, and D. D. Feng, “L2,0 constrained sparse dictionary selection for video summarization,” in 2014 IEEE International Conference on Multimedia and Expo (ICME), July 2014, pp. 1–6. https://doi.ieeecomputersociety.org/10.1109/ICME.2014.6890179
S. Mei, G. Guan, Z. Wang, S. Wan, M. He, and D. D. Feng, “Video summarization via minimum sparse reconstruction,” Pattern Recognition, vol. 48, no. 2, pp. 522 – 533, 2015. https://doi.org/10.1016/j.patcog.2014.08.002
M. Ogawa, T. Yamasaki, and K. Aizawa, “Hyperlapse generation of omnidirectional videos by adaptive sampling based on 3d camera positions,” in Proceedings of the IEEE International Conference on Image Processing (ICIP), Sep. 2017, pp. 2124–2128. https://doi.org/10.1109/ICIP.2017.8296657
P. Rani, A. Jangid, V. P. Namboodiri, and K. S. Venkatesh, “Visual odometry based omni-directional hyperlapse,” in National Conference on Computer Vision, Pattern Recognition, Image Processing, and Graphics, R. Rameshan, C. Arora, and S. Dutta Roy, Eds. Singapore: Springer Singapore, 2018, pp. 3–13. https://doi.org/10.1007/978-981-13-0020-2_1
W.-S. Lai, Y. Huang, N. Joshi, C. Buehler, M.-H. Yang, and S. B. Kang, “Semantic-driven Generation of Hyperlapse from 360 ◦ Video,” ArXiv e-prints, Mar. 2017.
M. Okamoto and K. Yanai, Summarization of Egocentric Moving Videos for Generating Walking Route Guidance. Berlin, Heidelberg: Springer Berlin Heidelberg, 2014, pp. 431–442. https://doi.org/10.1007/978-3-642-53842-1_37
T. Yao, T. Mei, and Y. Rui, “Highlight detection with pairwise deep ranking for first-person video summarization,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016. https://doi.org/10.1109/CVPR.2016.112
K. Higuchi, R. Yonetani, and Y. Sato, “Egoscanning: Quickly scanning first-person videos with egocentric elastic timelines,” in Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems, ser. CHI ’17. New York, NY, USA: ACM, 2017, pp. 6536–6546. https://doi.org/10.1145/3025453.3025821
S. Abu-El-Haija, N. Kothari, J. Lee, P. Natsev, G. Toderici, B. Varadarajan, and S. Vijayanarasimhan, “Youtube-8m: A large-scale video classi-fication benchmark,” CoRR, vol. abs/1609.08675, 2016.
M. M. Silva, W. L. S. Ramos, F. C. Chamone, J. P. K. Ferreira, M. F. M. Campos, and E. R. Nascimento, “Making a long story short: A multi-importance fast-forwarding egocentric videos with the emphasis on relevant objects,” Journal of Visual Communication and Image Representation (JVCI), vol. 53, pp. 55 – 64, 2018. https://doi.org/10.1016/j.jvcir.2018.02.013
M. Otani, Y. Nakashima, E. Rahtu, J. Heikkilä, and N. Yokoya, “Video summarization using deep semantic features,” in Proceedings of the Asian Conference on Computer Vision (ACCV). Cham: Springer International Publishing, 2017, pp. 361–377. https://doi.org/10.1007/978-3-319-54193-8_23
S. Lal, S. Duggal, and I. Sreedevi, “Online video summarization: Predicting future to better summarize present,” in Proceedings of the IEEE Winter Conference on Applications of Computer Vision (WACV), Hawaii, USA, January 2019, to appear. https://doi.org/10.1109/WACV.2019.00056
T.-J. Fu, S.-H. Tai, and H.-T. Chen, “Attentive and adversarial learning for video summarization,” in Proceedings of the IEEE Winter Conference on Applications of Computer Vision (WACV), Hawaii, USA, January 2019, to appear. https://doi.org/10.1109/WACV.2019.00173
J. Wang, J. Yang, K. Yu, F. Lv, T. Huang, and Y. Gong, “Locality-constrained linear coding for image classification,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), San Francisco, USA, June 2010, pp. 3360–3367. https://doi.org/10.1109/CVPR.2010.5540018
Y. Poleg, C. Arora, and S. Peleg, “Temporal segmentation of egocentric videos,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2014, pp. 2537–2544. https://doi.org/10.1109/CVPR.2014.325
J. Redmon and A. Farhadi, “YOLO9000: Better, Faster, Stronger,” ArXiv e-prints, Dec. 2016.
O. Pele and M. Werman, “Fast and robust earth mover’s distances,” in Proceedings of the IEEE International Conference on Computer Vision (ICCV), Sept 2009, pp. 460–467. https://doi.org/10.1109/ICCV.2009.5459199
M. M. Silva, W. L. S. Ramos, J. P. K. Ferreira, F. C. Chamone, M. F. M. Campos, and E. R. Nascimento, “A weighted sparse sampling and smoothing frame transition approach for semantic fast-forward first-person videos,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, USA, Jun. 2018, pp. 2383–2392. https://doi.org/10.1109/CVPR.2018.00253