A mobile device framework for video captioning using multimodal neural networks

  • Rafael J. P. Damaceno USP
  • Roberto M. Cesar Jr. USP


Video captioning is a computer vision task aimed at providing textual descriptions for videos. There are numerous strategies and datasets that can be employed to create models capable of addressing this task. In this study, we have devised a deep learning-based strategy that leverages both audio and image content to generate captions using resource-constrained devices. The datasets utilized include MSR-VTT and TREC-VTT22. We have developed an application tailored for resource-constrained devices that utilizes the optimal model resulting from our training process. Both modalities of data are then combined and processed by the model to generate a comprehensive description related to the captured data. The primary contribution of this work lies in the introduction of an innovative end-to-end application that leverages audio and image data. This application can be utilized on a mobile device to autonomously produce descriptions.


M. Abdar, M. Kollati, S. Kuraparthi, F. Pourpanah, D. McDuff, M. Ghavamzadeh, S. Yan, A. Mohamed, A. Khosravi, E. Cambria, and F. Porikli, “A review of deep learning for video captioning,” 2023.

Y. Wang, J. Wang, W. Zhang, Y. Zhan, S. Guo, Q. Zheng, and X. Wang, “A survey on deploying mobile deep learning applications: A systemic and technical perspective,” Digital Communications and Networks, vol. 8, no. 1, pp. 1–17, 2022.

N. Wang, J. Xie, H. Luo, Q. Cheng, J. Wu, M. Jia, and L. Li, “Efficient image captioning for edge devices,” 2022.

X. Jiao, Y. Yin, L. Shang, X. Jiang, X. Chen, L. Li, F. Wang, and Q. Liu, “Tinybert: Distilling bert for natural language understanding,” 2020.

W. Hao, Z. Zhang, and H. Guan, “Integrating both visual and audio cues for enhanced video caption,” in Proceedings of the AAAI conference on artificial intelligence, vol. 32, no. 1, 2018.

Y. Tian, C. Guan, J. Goodman, M. Moore, and C. Xu, “An attempt towards interpretable audio-visual video captioning,” arXiv preprint arXiv:1812.02872, 2018.

V. Iashin and E. Rahtu, “A better use of audio-visual cues: Dense video captioning with bi-modal transformer,” arXiv preprint arXiv:2005.08271, 2020.

Y. Shen, L. Yang, L. Wen, H. Yu, E. Elhamifar, and H. Wang, “Exploring the role of audio in video captioning,” arXiv preprint arXiv:2306.12559, 2023.

H. Liu and X. Wan, “Video paragraph captioning as a text summarization task,” in Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), 2021, pp. 55–60.

C. M. Fonseca and J. G. S. Paiva, “A system for visual analysis of objects behavior in surveillance videos,” in 2021 34th SIBGRAPI Conference on Graphics, Patterns and Images (SIBGRAPI). IEEE, 2021, pp. 176–183.

V. Iashin and E. Rahtu, “Multi-modal dense video captioning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, June 2020.

A. Karkar, J. Kunhoth, and S. Al-Maadeed, “A scene-to-speech mobile based application: Multiple trained models approach,” in 2020 IEEE International Conference on Informatics, IoT, and Enabling Technologies (ICIoT). IEEE, 2020, pp. 490–497.

V. Viswanatha, R. Chandana, and A. Ramachandra, “Iot based smart mirror using raspberry pi 4 and yolo algorithm: A novel framework for interactive display,” Indian Journal of Science and Technology, vol. 15, no. 39, pp. 2011–2020, 2022.

S. Aydin, Ö. Çayli, V. Kiliç, and O. Aytuğ, “Sequence-to-sequence video captioning with residual connected gated recurrent units,” Avrupa Bilim ve Teknoloji Dergisi, no. 35, pp. 380–386, 2022.

D. Kondratyuk, L. Yuan, Y. Li, L. Zhang, M. Tan, M. Brown, and B. Gong, “Movinets: Mobile video networks for efficient video recognition,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 16 020–16 030.

J. Xu, T. Mei, T. Yao, and Y. Rui, “Msr-vtt: A large video description dataset for bridging video and language,” in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 5288–5296.

J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018.

Ö. Çaylı, X. Liu, V. Kılıç, and W. Wang, “Knowledge distillation for efficient audio-visual video captioning,” arXiv preprint arXiv:2306.09947, 2023.
Como Citar

Selecione um Formato
DAMACENO, Rafael J. P.; CESAR JR., Roberto M.. A mobile device framework for video captioning using multimodal neural networks. In: WORKSHOP DE TRABALHOS EM ANDAMENTO - CONFERENCE ON GRAPHICS, PATTERNS AND IMAGES (SIBGRAPI), 36. , 2023, Rio Grande/RS. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2023 . p. 90-94. DOI: https://doi.org/10.5753/sibgrapi.est.2023.27457.