A mobile device framework for video captioning using multimodal neural networks

  • Rafael J. P. Damaceno USP
  • Roberto M. Cesar Jr. USP


Video captioning is a computer vision task aimed at providing textual descriptions for videos. There are numerous strategies and datasets that can be employed to create models capable of addressing this task. In this study, we have devised a deep learning-based strategy that leverages both audio and image content to generate captions using resource-constrained devices. The datasets utilized include MSR-VTT and TREC-VTT22. We have developed an application tailored for resource-constrained devices that utilizes the optimal model resulting from our training process. Both modalities of data are then combined and processed by the model to generate a comprehensive description related to the captured data. The primary contribution of this work lies in the introduction of an innovative end-to-end application that leverages audio and image data. This application can be utilized on a mobile device to autonomously produce descriptions.


