ABSTRACT
Temporal segmentation of video into semantically coherent scenes is a fundamental step to enhance video operations like browsing, retrieval and recommendation. Available automatic scene segmentation methods in the literature are still far, in terms of efficacy, from reasonable practical application requirements. Towards to lowering this gap, this paper presents a new multimodal early fusion based scene segmentation method, which extends the classical and powerful singlemodal bags-of-features latent semantics discriminative capability to a multimodal paradigm. This approach was designed to refine the latent semantics from singlemodal data by identifying and representing audiovisual patterns while still preserving singlemodal visual/aural words patterns. Experiments have been performed over a publicly available dataset where the proposed method achieved higher average values for the FCO metric than previous state-of-the-art approaches.
- Pradeep K. Atrey, M. Anwar Hossain, Abdulmotaleb El Saddik, and Mohan S. Kankanhalli. 2010. Multimodal Fusion for Multimedia Analysis: A Survey. Multimedia Syst. 16, 6 (Nov. 2010), 345--379. Google ScholarDigital Library
- Lorenzo Baraldi, Costantino Grana, and Rita Cucchiara. 2015. A Deep Siamese Network for Scene Detection in Broadcast Videos. In Proceedings of the 23rd ACM International Conference on Multimedia (MM '15). ACM, New York, NY, USA, 1199--1202. Google ScholarDigital Library
- Lorenzo Baraldi, Costantino Grana, and Rita Cucchiara. 2015. Measuring Scene Detection Performance. In Pattern Recognition and Image Analysis, Roberto Paredes, Jaime S. Cardoso, and Xosé M. Pardo (Eds.). Springer International Publishing, Cham, 395--403.Google Scholar
- BBC. 2006. Planet Earth. http://www.bbc.co.uk/programmes/b006mywy. {Online; accessed 25-may-2018}.Google Scholar
- Gertjan J. Burghouts and Jan-Mark Geusebroek. 2009. Performance Evaluation of Local Colour Invariants. Comput. Vis. Image Underst. 113, 1 (Jan. 2009), 48--62. Google ScholarDigital Library
- O. G. Cula and K. J. Dana. 2001. Compact representation of bidirectional texture functions. In Computer Vision and Pattern Recognition, 2001. CVPR 2001. Proceedings of the 2001 IEEE Computer Society Conference on, Vol. 1. IEEE, Kauai, HI, USA, USA, I--1041--I--1047 vol.1.Google Scholar
- S. Davis and P. Mermelstein. 1980. Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Transactions on Acoustics, Speech, and Signal Processing 28, 4 (August 1980), 357--366.Google ScholarCross Ref
- Manfred Del Fabro and Laszlo Böszörmenyi. 2013. State-of-the-art and future challenges in video scene detection: a survey. Multimedia Systems 19, 5 (2013), 427--454. Google ScholarDigital Library
- G. Gao and H. Ma. 2012. Multi-modality movie scene detection using Kernel Canonical Correlation Analysis. In Pattern Recognition (ICPR), 2012 21st International Conference on. IEEE, Tsukuba, Japan, 3074--3077.Google Scholar
- Bo Han and Weiguo Wu. 2011. Video scene segmentation using a novel boundary evaluation criterion and dynamic programming. In 2011 IEEE International Conference on Multimedia and Expo. IEEE, Barcelona, Spain, 1--6. Google ScholarDigital Library
- Xian-Sheng Hua, Dong Zhang, Mingjing Li, and Hong-Jiang Zhang. 2002. Performance Evaluation Protocol for Video Scene Detection Algorithms. In Workshop on Multimedia Information Retrieval, in conjunction with 10th ACM Multimedia. ACM, Juan-les-Pins, France.Google Scholar
- I-Hong Jhuo, Guangnan Ye, Shenghua Gao, Dong Liu, Yu-Gang Jiang, D. T. Lee, and Shih-Fu Chang. 2014. Discovering joint audio--visual codewords for video event detection. Machine Vision and Applications 25, 1 (2014), 33--47. Google ScholarDigital Library
- Jonathan Hare, Sina Samangooei, and David Dupplaw. 2011. OpenIMAJ and ImageTerrier: Java Libraries and Tools for Scalable Multimedia Analysis and Indexing of Images. In ACM Multimedia 2011. ACM, Scottsdale, Arizona, USA, 691--694. http://eprints.soton.ac.uk/273040/ Event Dates: 28/11/2011 until 1/12/2011. Google ScholarDigital Library
- Rodrigo Mitsuo Kishi, Tiago Henrique Trojahn, and Rudinei Goularte. 2016. An Evaluation of Readily Usable Automatic Video Shot Segmentation Techniques. In Proceedings of the 22Nd Brazilian Symposium on Multimedia and the Web (Webmedia '16). ACM, New York, NY, USA, 199--202. Google ScholarDigital Library
- Irena Koprinska and Sergio Carrato. 2001. Temporal video segmentation: A survey. Signal Processing: Image Communication 16 (2001), 477--500.Google ScholarCross Ref
- Stuart P. Lloyd. 1982. Least squares quantization in pcm. IEEE Transactions on Information Theory 28 (1982), 129--137. Google ScholarDigital Library
- Bruno Lorenço Lopes, Tiago Henrique Trojahn, and Rudinei Goularte. 2014. Video Scene Detection by Multimodal Bag of Features. Journal of Information and Data Management 5, 2 (2014), 194.Google Scholar
- Daniel Moreira, Sandra Avila, Mauricio Perez, Daniel Moraes, Vanessa Testoni, Eduardo Valle, Siome Goldenstein, and Anderson Rocha. 2019. Multimodal data fusion for sensitive scene localization. Information Fusion 45 (2019), 307 -- 323.Google ScholarCross Ref
- K. Sreenivasa Rao and Shashidhar G. Koolagudi. 2012. Emotion Recognition Using Speech Features. Springer Publishing Company, Incorporated, New York, USA. Google ScholarDigital Library
- C. Saraceno and R. Leonardi. 1997. Audio as a support to scene change detection and characterization of video sequences. In Acoustics, Speech, and Signal Processing, 1997. ICASSP-97., 1997 IEEE International Conference on, Vol. 4. IEEE, Munich, Germany, 2597--2600 vol.4. Google ScholarDigital Library
- P. Sidiropoulos, V. Mezaris, I. Kompatsiaris, H. Meinedo, M. Bugalho, and I. Trancoso. 2011. Temporal Video Segmentation to Scenes Using High-Level Audiovisual Features. IEEE Trans. Cir. and Sys. for Video Technol. 21, 8 (Aug. 2011), 1163--1177. Google ScholarDigital Library
- Michael J. Swain and Dana H. Ballard. 1991. Color indexing. International Journal of Computer Vision 7, 1 (01 Nov 1991), 11--32. Google ScholarDigital Library
- A. Toffler. 1970. Future Shock. Random House, New York, USA. https://books.google.com.br/books?id=-BhHAAAAMAAJGoogle Scholar
- Tiago H. Trojahn and Rudinei Goularte. 2013. Video Scene Segmentation by Improved Visual Shot Coherence. In Proceedings of the 19th Brazilian Symposium on Multimedia and the Web (WebMedia '13). ACM, New York, NY, USA, 23--30. Google ScholarDigital Library
- J. Vendrig and M. Worring. 2002. Systematic evaluation of logical story unit segmentation. IEEE Transactions on Multimedia 4, 4 (Dec 2002), 492--499. Google ScholarDigital Library
- X. Wang, L. Gao, P. Wang, X. Sun, and X. Liu. 2018. Two-Stream 3-D convNet Fusion for Action Recognition in Videos With Arbitrary Size and Length. IEEE Transactions on Multimedia 20, 3 (March 2018), 634--644. Google ScholarDigital Library
- S. Wu and M. Jin. 2015. Study on a new video scene segmentation algorithm. Applied Mathematics and Information Sciences 9, 1 (2015), 361--368. cited By 0.Google ScholarCross Ref
- Minerva Yeung, Boon-Lock Yeo, and Bede Liu. 1998. Segmentation of Video by Clustering and Graph Analysis. Comput. Vis. Image Underst. 71, 1 (July 1998), 94--109. Google ScholarDigital Library
Index Terms
- Temporal Video Scene Segmentation By Fused Bags-of-Features
Recommendations
Video scene segmentation by improved visual shot coherence
WebMedia '13: Proceedings of the 19th Brazilian symposium on Multimedia and the webNowadays, there a increasing interest in video scene segmentation due huge amount of videos available through services like YouTube. Although there are some techniques which obtain relatively good precision and recall values when segmenting the video in ...
Multimodal early fusion operators for temporal video scene segmentation tasks
AbstractThe Temporal Video Scene Segmentation (TVSS) task is still an open problem presenting challenges in the Multimedia Analysis area. Current approaches employ multimodality, fusing features from different video data modalities as a way to improve ...
A semantic-based video scene segmentation using a deep neural network
Video scene segmentation is very important research in the field of computer vision, because it helps in efficient storage, indexing and retrieval of videos. Achieving this kind of scene segmentation cannot be done by just calculating the similarity of ...
Comments