Musical Hyperlapse: A Multimodal Approach to Accelerate First-Person Videos
Resumo
With the advance in technology and social media usage, first-person recording videos has become a common habit. These videos are usually very long and tiring to watch, bringing the need to speed up them. Despite recent progress of fast-forward methods, they do not consider inserting background music in the videos, which could make them more enjoyable. This thesis presents a new method that creates accelerated videos and includes the background music keeping the same emotion induced by visual and acoustic modalities. Our approach is based on the automatic recognition of emotions induced by music and video contents and an optimization algorithm that maximizes the visual quality of the output video and seeks to match the similarity of the music and the video’s emotions. Quantitative results show that our method achieves the best performance in matching emotion similarity while maintaining the visual quality of the output video compared with other literature methods. Visual results can be seen through the link: https://youtu.be/9ykQa9zhcz8.
Referências
W. L. S. Ramos, M. M. Silva, M. F. M. Campos, and E. R. Nascimento, "Fast-forward video based on semantic extraction," in 2016 IEEE International Conference on Image Processing (ICIP), Sep. 2016, pp. 3334-3338.
W.-S. Lai, Y. Huang, N. Joshi, C. Buehler, M.-H. Yang, and S. B. Kang, "Semantic-driven generation of hyperlapse from 360 video," ArXiv, vol. abs/1703.10798, 2017.
T. Halperin, Y. Poleg, C. Arora, and S. Peleg, "Egosampling: Wide view hyperlapse from egocentric videos," IEEE Transactions on Circuits and Systems for Video Technology, vol. 28, no. 5, pp. 1248-1259, 2018.
M. M. Silva, W. L. S. Ramos, F. C. Chamone, J. P. K. Ferreira, M. F. M. Campos, and E. R. Nascimento, "Making a long story short: A multi-importance fast-forwarding egocentric videos with the emphasis on relevant objects," Journal of Visual Communication and Image Representation, vol. 53, p. 55 - 64, 2018.
M. M. Silva, W. L. S. Ramos, J. P. K. Ferreira, F. C. Chamone, M. F. M. Campos, and E. R. Nascimento, "A weighted sparse sampling and smoothing frame transition approach for semantic fast-forward first-person videos," in 2018 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, USA, Jun. 2018, pp. 2383- 2392.
V. S. Furlan, R. Bajcsy, and E. R. Nascimento, "Fast forwarding egocentric videos by listening and watching," in In IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshop on Sight and Sound. IEEE Computer Society, 2018, p. 2504-2507.
M. Wang, J.-B. Liang, S.-H. Zhang, S.-P. Lu, A. Shamir, and S.-M. Hu, "Hyper-lapse from multiple spatially-overlapping videos," Trans. Img. Proc., vol. 27, no. 4, p. 1735-1747, Apr. 2018.
W. L. S. Ramos, M. M. Silva, E. R. Araujo, A. C. Neves, and E. R. Nascimento, "Personalizing fast-forward videos based on visual and textual features from social network," in 2020 IEEE Winter Conference on Applications of Computer Vision (WACV), 2020, pp. 3260-3269.
J. Kopf, M. Cohen, and R. Szeliski, "First-person hyperlapse videos," in ACM Transactions on Graphics (Proc. SIGGRAPH 2014), vol. 33. ACM - Association for Computing Machinery, August 2014.
Y. Poleg, T. Halperin, C. Arora, and S. Peleg, "Egosampling: Fast-forward and stereo for egocentric videos," 2015, pp. 4768-4776.
M. Müller, "Dynamic time warping," Information Retrieval for Music and Motion, vol. 2, pp. 69-84, 01 2007.
M. Okamoto and K. Yanai, "Summarization of egocentric moving videos for generating walking route guidance," pp. 431-442, 2014.
M. Silva, W. Ramos, M. Campos, and E. R. Nascimento, "A sparse sampling-based framework for semantic fast-forward of first-person videos," vol. 43, no. 4, pp. 1438-1444, 2021.
A. Alpher, "A circumplex model of affect," Journal of Personality and Social Psychology, vol. 39, no. 6, pp. 1161--1178, 1980.
A. Toet and J. B. van Erp, "The emojigrid as a tool to assess experienced and perceived emotions," Psych, vol. 1, no. 1, pp. 469-481, 2019.
Lie Lu, D. Liu, and Hong-Jiang Zhang, "Automatic mood detection and tracking of music audio signals," IEEE Transactions on Audio, Speech, and Language Processing, vol. 14, no. 1, pp. 5-18, Jan 2006.
R. Panda, R. M. Malheiro, and R. P. Paiva, "Novel audio features for music emotion recognition," IEEE Transactions on Affective Computing, pp. 1-1, 2018.
Y. Yang, Y. Lin, Y. Su, and H. H. Chen, "A regression approach to music emotion recognition," IEEE Transactions on Audio, Speech, and Language Processing, vol. 16, no. 2, pp. 448-457, Feb 2008.
S. Chowdhury, A. Vall, V. Haunschmid, and G. Widmer, "Towards explainable music emotion recognition: The route via mid-level features," International Society for Music Information Retrieval Conference, 07 2019.
N. Thammasan, K. Moriyama, K.-i. Fukui, and M. Numao, "Continuous music-emotion recognition based on electroencephalogram," IEICE Transactions on Information and Systems, vol. E99.D, pp. 1234-1241, 04 2016.
Y. Dong, X. Yang, X. Zhao, and J. Li, "Bidirectional convolutional recurrent sparse network (bcrsn): An efficient model for music emotion recognition," IEEE Transactions on Multimedia, pp. 1-1, 2019.
A. Toet and v. Erp, "Emomadrid: An emotional pictures database for affect research," 12 2019.
S. E. Dan-Glauser and R. K. Scherer, "The geneva affective picture database (gaped): a new 730-picture database focusing on valence and normative significance," Behavior Research Methods, pp. 468-477, 2011.
V. Dalmia, H. Liu, and S. Chang, "Columbia mvso image sentiment dataset," ArXiv, vol. abs/1611.04455, 2016.
D. Joshi, R. Datta, E. Fedorovskaya, Q. Luong, J. Z. Wang, J. Li, and J. Luo, "Aesthetics and emotions in images," IEEE Signal Processing Magazine, vol. 28, no. 5, pp. 94-115, 2011.
S. Zhao, Y. Gao, X. Jiang, H. Yao, T.-S. Chua, and X. Sun, "Exploring principles-of-art features for image emotion recognition," MM 2014 - Proceedings of the 2014 ACM Conference on Multimedia, pp. 47-56, 11 2014.
J. Jia, S. Wu, X. Wang, P. Hu, L. Cai, and J. Tang, "Can we understand van gogh's mood? learning to infer affects from images in social networks," ACM International Conference on Multimedia, 10 2012.
D. Borth, R. Ji, T. Chen, T. Breuel, and S. Chang, "Large-scale visual sentiment ontology and detectors using adjective noun pairs," MM 2013 - Proceedings of the 2013 ACM Multimedia Conference, pp. 223-232, 10 2013.
R. Plutchik, Emotion, a Psychoevolutionary Synthesis. Harper & Row, 1980.
T. Mittal, P. Guhan, U. Bhattacharya, R. Chandra, A. Bera, and D. Manocha, "Emoticon: Context-aware multimodal emotion recognition using frege's principle," in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020.
K. He, X. Zhang, S. Ren, and J. Sun, "Deep residual learning for image recognition," in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 770-778.
R. Panda, R. M. Malheiro, and R. P. Paiva, "Audio features for music emotion recognition: a survey," IEEE Transactions on Affective Computing, pp. 1-1, 2020.
M. Solymanil, A. Aljanakil, and Y.-H. Yang, "DEAM: Mediaeval database for emotional analysis in music," 2018.
Zhou Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, "Image quality assessment: from error visibility to structural similarity," IEEE Transactions on Image Processing, vol. 13, no. 4, pp. 600-612, 2004.
M. Silva, W. Ramos, M. Campos, and E. R. Nascimento, "A sparse sampling-based framework for semantic fast-forward of firstperson videos," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 43, no. 4, pp. 1438-1444, 2021.
A. Mathiasen and F. Hvilshøj, "Fast fréchet inception distance," 2020, aarhus University.
D. de Matos, W. Ramos, L. Romanhol, and E. R. Nascimento, "Musical hyperlapse: A multimodal approach to accelerate first-person videos," in Conference on Graphics, Patterns and Images (SIBGRAPI), 2021, pp. 184-191.
D. de Matos, W. Ramos, M. Silva, L. Romanhol, and E. R. Nascimento, "A multimodal hyperlapse method based on video and songs' emotion alignment," Pattern Recognition Letters, 2022.