Reconhecimento de Emoções em Vídeos: Uma Análise Comparativa de LSTM, CNNs, YOLO e Vision Transformers no CMU-MOSEI

Daniel Casanova; Pedro Luiz de Paula Filho; Kelyn Schenatto; Alessandra B. G. Hoffmann

doi:10.5753/latinoware.2025.16446

Daniel Casanova UTFPR
Pedro Luiz de Paula Filho UTFPR
Kelyn Schenatto UTFPR
Alessandra B. G. Hoffmann UTFPR

DOI: https://doi.org/10.5753/latinoware.2025.16446

Resumo

Este trabalho apresenta um estudo comparativo entre diferentes arquiteturas de aprendizado profundo aplicadas à análise de vídeos utilizando o dataset CMU-MOSEI. Foram avaliados modelos que capturam explicitamente dependências temporais, como a Long Short-Term Memory (LSTM), em comparação com abordagens baseadas em frames, como a ResNet50, o Vision Transformer (ViT) e o YOLOv11 adaptado para classificação. Os experimentos consideraram as métricas de acurácia e precisão para uma análise sistemática de desempenho. Os resultados mostram que o ViT obteve o melhor desempenho (78%), enquanto o YOLOv11 e a ResNet50 se destacaram como alternativas competitivas em termos de eficiência e estabilidade. Por outro lado, o modelo LSTM apresentou o pior resultado (53%), indicando que a modelagem temporal explícita foi menos eficaz que estratégias de agregação de frames neste conjunto de dados. Esses achados evidenciam os trade-offs entre precisão, eficiência e aplicabilidade em sistemas reais de reconhecimento de emoções baseados em vídeo.

Palavras-chave: Análise de vídeo, Reconhecimento de emoções, Vision Transformer

Referências

Y. Kong and Y. Fu, “Human action recognition and prediction: A survey,” International Journal of Computer Vision, vol. 130, no. 5, pp. 1366–1401, 2022.

S. Khan, M. Naseer, M. Hayat, S. W. Zamir, F. S. Khan, and M. Shah, “Transformers in vision: A survey,” ACM Computing Surveys, 2022.

S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.

J. Donahue et al., “Long-term recurrent convolutional networks for visual recognition and description,” in CVPR, 2015, pp. 2625–2634.

J. Y.-H. Ng et al., “Beyond short snippets: Deep networks for video classification,” in CVPR, 2015, pp. 4694–4702.

K. He et al., “Deep residual learning for image recognition,” in CVPR, 2016, pp. 770–778.

A. Dosovitskiy, L. Beyer, A. Kolesnikov, and et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” in International Conference on Learning Representations (ICLR), 2021.

A. Arnab et al., “Vivit: A video vision transformer,” in ICCV, 2021, pp. 6836–6846.

M. Rahima, M. Karim, M. Al-Amin, T. Hossain, and M. Uddin, “A comprehensive review of yolo models for object detection,” arXiv preprint arXiv:2501.04665, 2025. [Online]. Available: [link]

Y. Zhang et al., “Comparative performance of yolov8, yolov9, yolov10, yolov11 and faster r-cnn models for detection of multiple weed species,” Smart Agricultural Technology, vol. 9, p. 100533, 2024.

X. Tian, H. Wang, Z. Li, J. Chen, and Y. Xu, “Yolov12: Attentioncentric real-time object detectors,” arXiv preprint arXiv:2501.01599, 2025. [Online]. Available: [link]

G. Bertasius, H. Wang, and L. Torresani, “Is space-time attention all you need for video understanding?” in ICML, 2021, pp. 813–824.

Z. Liu, J. Ning, Y. Cao, Y. Wei, Z. Zhang, S. Lin, and H. Hu, “Video swin transformer,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2022, pp. 3202–3211.

A. Zadeh et al., “Multimodal language analysis in the wild: Cmu-mosei dataset and interpretable dynamic fusion graph,” in ACL, 2018, pp. 2236–2246.

A. Graves, A.-r. Mohamed, and G. Hinton, “Speech recognition with deep recurrent neural networks,” in ICASSP, 2013, pp. 6645–6649.

I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence learning with neural networks,” in NIPS, 2014, pp. 3104–3112.

A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in NIPS, 2012, pp. 1097–1105.

A. Karpathy et al., “Large-scale video classification with convolutional neural networks,” in CVPR, 2014, pp. 1725–1732.

D. Tran et al., “Learning spatiotemporal features with 3d convolutional networks,” in ICCV, 2015, pp. 4489–4497.

J. Redmon and A. Farhadi, “Yolov3: An incremental improvement,” 2018.

A. Bochkovskiy, C.-Y. Wang, and H.-Y. M. Liao, “Yolov4: Optimal speed and accuracy of object detection,” arXiv preprint arXiv:2004.10934, 2020.

H. Fan, B. Xiong, K. Mangalam, Y. Li, Z. Yan, J. Malik, and C. Feichtenhofer, “Multiscale vision transformers,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). IEEE, 2021, pp. 6824–6835.

M. Tan and Q. V. Le, “Efficientnetv2: Smaller models and faster training,” in International Conference on Machine Learning (ICML), 2021.

A. Howard et al., “Mobilenetv4: Universal efficient convnets for mobile vision,” arXiv preprint, 2023.

A. Paszke et al., “Pytorch: An imperative style, high-performance deep learning library,” in NeurIPS, 2019, pp. 8026–8037.

Autores, “Hybrid efficientnet-b7 with tcn and lstm for video-based engagement detection,” International Journal of Novel Research and Development (IJNRD), 2025. [Online]. Available: [link]

T. Mittal, U. Bhattacharya, R. Chandra, and D. Manocha, “M3er: Multiplicative multimodal emotion recognition,” in Proceedings of AAAI, 2020, relatório em slides disponível em SlideShare sobre fusão multiplicativa de características multimodais no CMU-MOSEI.

D. Mamieva, A. B. Abdusalomov, A. Kutlimuratov, B. Muminov, and T. K. Whangbo, “Multimodal emotion detection via attention-based fusion of extracted facial and speech features,” Sensors, vol. 23, no. 12, p. 5475, 2023.