Evaluating Early Fusion Operators at Mid-Level Feature Space

  • Antonio A. R. Beserra USP
  • Rodrigo Mitsuo Kishi UFMS
  • Rudinei Goularte USP

Resumo


Early fusion techniques have been proposed in video analysis tasks as a way to improve efficacy by generating compact data models capable of keeping semantic clues present on multimodal data. First attempts to fuse multimodal data employed fusion operators at low-level feature space, losing data representativeness. This drove later research efforts to evolve simple operators to complex operations, which became, in general, inseparable of the multimodal semantic clues processing. In this paper, we investigate the application of early multimodal fusion operators at the mid-level feature space. Five different operators (Concatenation, Sum, Gram, Average and Maximum) were employed to fuse mid-level multimodal video features. Fused data derived from each operator were then used as input for two different video analysis tasks: Temporal Video Scene Segmentation and Video Classification. For each task, we performed a comparative analysis among the operators and related work techniques designed for these tasks using complex fusion operations. The efficacy results reached by the operators were very close to those reached by the techniques, pointing out strong evidence that working on a more homogeneous feature space can reduce known low-level fusion drawbacks. In addition, operators make data fusion separable, allowing researchers to keep the focus on developing semantic clues representations.
Palavras-chave: Video analysis, Fusion operators, Early fusion
Publicado
30/11/2020
BESERRA, Antonio A. R.; KISHI, Rodrigo Mitsuo; GOULARTE, Rudinei. Evaluating Early Fusion Operators at Mid-Level Feature Space. In: BRAZILIAN SYMPOSIUM ON MULTIMEDIA AND THE WEB (WEBMEDIA), 1. , 2020, Evento Online. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2020 . p. 91-98.

Artigos mais lidos do(s) mesmo(s) autor(es)

<< < 1 2