Successful Youtube video identification using multimodal deep learning

Lucas de Souza Rodrigues; Kenzo Sakiyama; Leozitor Floro de Souza; Edson Takashi Matsubara; Bruno Nogueira

doi:10.5753/kdmile.2022.227792

Lucas de Souza Rodrigues Universidade Federal de Mato Grosso do Sul
Kenzo Sakiyama Universidade de São Paulo
Leozitor Floro de Souza Universidade de São Paulo
Edson Takashi Matsubara Universidade Federal de Mato Grosso do Sul
Bruno Nogueira Universidade Federal de Mato Grosso do Sul

DOI: https://doi.org/10.5753/kdmile.2022.227792

Resumo

Text from titles and audio transcriptions, image thumbnails, number of likes, dislikes, and views are examples of available data in a YouTube video. Despite the variability, most standard Deep Learning models use only one type of data. Moreover, the simultaneous use of multiple data sources for such problems is still rare. To shed light on these problems, we empirically evaluate eight different multimodal fusion operations using embeddings extracted from image thumbnails and video titles of YouTube videos using standard Deep Learning models, ResNet-based SE-Net for image feature extraction, and BERT to NLP. Experimental results show that simple operations such as sum or subtract embeddings can improve the accuracy of models. The multimodal fusion operations in this dataset achieved 81.3% accuracy, outperforming the unimodal models by 3.86% (text) and 5.79% (video).

Palavras-chave: multimodal, fusion, deep learning

Referências

Carta, S., Giuliani, A., Piano, L., Podda, A. S., and Recupero, D. R. Vstar: Visual semantic thumbnails and tags revitalization. Expert Systems with Applications vol. 193, pp. 116375, 2022.

Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.

Gadzicki, K., Khamsehashari, R., and Zetzsche, C. Early vs late fusion in multimodal convolutional neural networks. In 2020 IEEE 23rd International Conference on Information Fusion (FUSION). IEEE, pp. 1–6, 2020.

Goodfellow, I., Bengio, Y., and Courville, A. Deep learning. MIT press, 2016.

He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 770–778, 2016.

Hu, J., Shen, L., and Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 7132–7141, 2018.

Islam, M., Ria, N. J., Masum, A. K. M., and Ani, J. F. Performance comparison of multiple supervised learning algorithms for youtube exaggerated bangla titles classification. In 2021 12th International Conference on Computing Communication and Networking Technologies (ICCCNT). IEEE, pp. 1–6, 2021.

Joze, H. R. V., Shaban, A., Iuzzolino, M. L., and Koishida, K. Mmtm: Multimodal transfer module for cnn fusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 13289–13299, 2020.

Kalra, G. S., Kathuria, R. S., and Kumar, A. Youtube video classification based on title and description text. In 2019 International Conference on Computing, Communication, and Intelligent Systems (ICCCIS). IEEE, pp. 74–79, 2019.

Kim, T. K. T test as a parametric statistic. Korean journal of anesthesiology 68 (6): 540–546, 2015.

LeCun, Y., Bengio, Y., and Hinton, G. Deep learning. nature 521 (7553): 436–444, 2015.

Liu, K., Li, Y., Xu, N., and Natarajan, P. Learn to combine modalities in multimodal deep learning. arXiv preprint arXiv:1805.11730, 2018.

Loshchilov, I. and Hutter, F. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.

Poria, S., Cambria, E., and Gelbukh, A. Deep convolutional neural network textual features and multiple kernel learning for utterance-level multimodal sentiment analysis. In Proceedings of the 2015 conference on empirical methods in natural language processing. pp. 2539–2544, 2015.

Purushwalkam, S. and Gupta, A. Pose from action: Unsupervised learning of pose features based on motion. arXiv preprint arXiv:1609.05420, 2016.

Ramachandram, D. and Taylor, G. W. Deep multimodal learning: A survey on recent advances and trends. IEEE signal processing magazine 34 (6): 96–108, 2017.

Smith, L. N. Cyclical learning rates for training neural networks. In 2017 IEEE winter conference on applications of computer vision (WACV). IEEE, pp. 464–472, 2017.

Smith, L. N. and Topin, N. Super-convergence: Very fast training of neural networks using large learning rates. In Artificial intelligence and machine learning for multi-domain operations applications. Vol. 11006. SPIE, pp. 369–386, 2019.

Song, Y., Redi, M., Vallmitjana, J., and Jaimes, A. To click or not to click: Automatic selection of beautiful thumbnails from videos. In Proceedings of the 25th ACM international on conference on information and knowledge management. pp. 659–668, 2016.

Souza, F., Nogueira, R., and Lotufo, R. Bertimbau: pretrained bert models for brazilian portuguese. In Brazilian conference on intelligent systems. Springer, pp. 403–417, 2020.

Teng, E., Falcão, J. D., Huang, R., and Iannucci, B. Clickbait: click-based accelerated incremental training of convolutional neural networks. In 2018 IEEE Applied Imagery Pattern Recognition Workshop (AIPR). IEEE, pp. 1–12, 2018.

Trong, V. H., Gwang-hyun, Y., Vu, D. T., and Jin-young, K. Late fusion of multimodal deep neural networks for weeds classification. Computers and Electronics in Agriculture vol. 175, pp. 105506, 2020.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. Attention is all you need. Advances in neural information processing systems vol. 30, 2017.

Wightman, R. Pytorch image models. GitHub repository, 2019.

Witten, I. H. and Frank, E. Data mining: practical machine learning tools and techniques with java implementations. Acm Sigmod Record 31 (1): 76–77, 2002.

Wu, R., Yan, S., Shan, Y., Dang, Q., and Sun, G. Deep image: Scaling up image recognition. arXiv preprint arXiv:1501.02876 7 (8), 2015.

Yu, Z. and Shi, N. A multi-modal deep learning model for video thumbnail selection. arXiv preprint arXiv:2101.00073, 2020.

Zhou, R., Khemmarat, S., and Gao, L. The impact of youtube recommendation system on video views. In Proceedings of the 10th ACM SIGCOMM conference on Internet measurement. pp. 404–410, 2010.