A new multimodal deep-learning model to video scene segmentation

  • Tiago H. Trojahn USP-IFSP
  • Rodrigo M. Kishi USP-UFMS
  • Rudinei Goularte USP


The recent development of deep learning techniques, like convolutional networks, shed a new light over the video (story) scene segmentation problem, bringing the potential to outperform state-of-the-art non-deep learning multimodal approaches. However, one important aspect of the multimodality still needs investigation in the context of deep learning: the multimodal fusion. Often, features are directly fed to a network, which may be an inadequate approach to perform the underlying multimodal fusion. This paper presents an evaluation of early and late approaches to deep learning multimodal fusion. In addition, it proposes a new deep learning model to perform video scene segmentation, based on convolutional network feature extraction capabilities and a recurrent neural network architecture. The results show the early versus late fusion discussion is reopened regarding deep learning. Moreover, the results prove the proposed model is competitive against state-of-the-art techniques when evaluated over a public documentary video dataset obtaining up to 64 of average FCO, while also maintaining a lower computational cost when compared with a related convolutional approach.
Palavras-chave: Scene segmentation, deep learning, RNN, CNN
Como Citar

Selecione um Formato
TROJAHN, Tiago H.; KISHI, Rodrigo M.; GOULARTE, Rudinei. A new multimodal deep-learning model to video scene segmentation. In: SIMPÓSIO BRASILEIRO DE SISTEMAS MULTIMÍDIA E WEB (WEBMEDIA), 24. , 2018, Salvador. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2018 . p. 205-212.

Artigos mais lidos do(s) mesmo(s) autor(es)

1 2 > >>