CV-C3D: Action Recognition on Compressed Videos with Convolutional 3D Networks

Samuel Felipe dos Santos; Nicu Sebe; Jurandy Almeida

doi:10.5753/sibgrapi.2019.9782

Samuel Felipe dos Santos UNIFESP
Nicu Sebe University of Trento
Jurandy Almeida UNIFESP

DOI: https://doi.org/10.5753/sibgrapi.2019.9782

Resumo

Action recognition in videos has gained substantial attention from the computer vision community due to the wide range of possible applications. Recent works have addressed this problem with deep learning methods. The main limitation of existing approaches is their difficulty to learn temporal dynamics due to the high computational load demanded for processing huge amounts of data required to train a model. To overcome this problem, we propose a Compressed Video Convolutional 3D network (CV-C3D). It exploits information from the compressed representation of a video in order to avoid the high computational cost for fully decoding the video stream. The speed up of the computation enables our network to use 3D convolutions for capturing the temporal context efficiently. Our network has the lowest computational complexity among all the compared approaches. Results of our approach in the task of action recognition on two public benchmarks, UCF-101 and HMDB-51, were comparable to the baselines, with the advantage of running at faster inference speed.

Palavras-chave: Computer Vision, Action Recognition, Deep Learning, Compressed Domain, Efficiency

Referências

Y. Yan, C. Xu, D. Cai, J. J. Corso, "Weakly supervised actor-action segmentation via robust multi-task ranking", IEEE International Conference on Computer Vision and Pattern Recognition (CVPR'17), pp. 1022-102017.

M. Koohzadi, N. M. Charkari, "Survey on deep learning methods in human action recognition", IET Computer Vision, vol. no. 8, pp. 623-62017.

J. Almeida, A. Rocha, R. S. Torres, S. Goldenstein, "Making colors worth more than a thousand words", ACM International Symposium on Applied Computing (ACM-SAC'08), pp. 1180-112008.

F. S. P. Andrade, J. Almeida, H. Pedrini, R. S. Torres, "Fusion of local and global descriptors for content-based image and video retrieval", Iberoamerican Congress on Pattern Recognition (CIARP'12), pp. 845-82012.

O. A. B. Penatti, L. T. Li, J. Almeida, R. S. Torres, "A visual approach for video geocoding using bag-of-scenes", ACM International Conference on Multimedia Retrieval (ICMR'12), pp. 1-8, 2012.

I. C. Duta, J. R. R. Uijlings, B. Ionescu, K. Aizawa, A. G. Hauptmann, N. Sebe, "Efficient human action recognition using histograms of motion gradients and VLAD with descriptor shape information", Multimedia Tools and Applications, vol. no. pp. 22445-2242017.

D. Wu, N. Sharma, M. Blumenstein, "Recent advances in video-based human action recognition using deep learning: A review", International Joint Conference on Neural Networks (IJCNN'17), pp. 2865-282017.

Y. Bengio, "Learning deep architectures for ai", Foundations and Trends in Machine Learning, vol. 2, no. 1, pp. 1-12009.

S.-M. Kang, R. P. Wildes, "Review of action recognition and detection methods", CoRR, vol. abs/1610.06906, 2016.

M. Asadi-Aghbolaghi, A. Clapés, M. Bellantonio, H. J. Escalante, V. Ponce-López, X. Baró, I. Guyon, S. Kasaei, S. Escalera, "A survey on deep learning based approaches for action and gesture recognition in image sequences", IEEE International Conference on Automatic Face & Gesture Recognition (FG'17), pp. 476-42017.

F. Zhu, L. Shao, J. Xie, Y. Fang, "From handcrafted to learned representations for human action recognition: A survey", Image and Vision Computing, vol. pp. 42-2016.

S. Herath, M. T. Harandi, F. Porikli, "Going deeper into action recognition: A survey", Image and Vision Computing, vol. pp. 4-2017.

I. C. Duta, B. Ionescu, K. Aizawa, N. Sebe, "Spatio-temporal VLAD encoding for human action recognition in videos", International Conference on MultiMedia Modeling (MMM'17), pp. 365-32017.

L. A. Duarte, O. A. B. Penatti, J. Almeida, "Bag of attributes for video event retrieval", SIBGRAPI - Conference on Graphics Patterns and Images (SIBGRAPI'I8), pp. 447-42018.

I. C. Duta, B. Ionescu, K. Aizawa, N. Sebe, "Spatio-temporal vector of locally max pooled features for action recognition in videos", IEEE International Conference on Computer Vision and Pattern Recognition (CVPR'17), pp. 3205-322017.

I. C. Duta, B. Ionescu, K. Aizawa, N. Sebe, "Simple efficient and effective encodings of local deep features for video action recognition", ACM International Conference on Multimedia Retrieval (ICMR'17), pp. 218-22017.

D.-A. Huang, V. Ramanathan, D. Mahajan, L. Torresani, M. Paluri, L. Fei-Fei, J. C. Niebles, "What makes a video a video: Analyzing temporal information in video understanding models and datasets", IEEE International Conference on Computer Vision and Pattern Recognition (CVPR'18), pp. 7366-732018.

C.-Y. Wu, M. Zaheer, H. Hu, R. Manmatha, A. J. Smola, P. Krähenbühl, "Compressed video action recognition", IEEE International Conference on Computer Vision and Pattern Recognition (CVPR'18), pp. 6026-602018.

R. V. Babu, M. Tom, P. Wadekar, "A survey on compressed domain video analysis techniques", Multimedia Tools and Applications, vol. no. 2, pp. 1043-102016.

V. Bhaskaran, K. Konstantinides, Image and Video Compression Standards: Algorithms and Architectures, Kluwer Academic Publishers, 1997.

J. Almeida, N. J. Leite, R. S. Torres, "Comparison of video sequences with histograms of motion patterns", IEEE International Conference on Image Processing (ICIP’11), pp. 3673-362011.

V. Srinivasan, S. Lapuschkin, C. Hellge, K.-R. Müller, W. Samek, "Interpretable human action recognition in compressed domain", IEEE International Conference on Acoustics Speech and Signal Processing(ICASSP'17), pp. 1692-1696, 2017.

M. Tom, R. V. Babu, R. G. Praveen, "Compressed domain human action recognition in H.264/AVC video streams", Multimedia Tools and Applications, vol. no. pp. 9323-932015.

B. Zhang, L. Wang, Z. Wang, Y. Qiao, H. Wang, "Real-time action recognition with enhanced motion vector cnns", IEEE International Conference on Computer Vision and Pattern Recognition (CVPR'16), pp. 2718-272016.

B. Zhang, L. Wang, Z. Wang, Y. Qiao, H. Wang, "Real-time action recognition with deeply transferred motion vector cnns", IEEE Transactions on Image Processing, vol. no. 5, pp. 2326-232018.

K. Simonyan, A. Zisserman, "Two-stream convolutional networks for action recognition in videos", Annual Conference on Neural Information Processing Systems (NIPS'14), pp. 568-52014.

D. Tran, L. D. Bourdev, R. Fergus, L. Torresani, M. Paluri, "Learning spatiotemporal features with 3d convolutional networks", IEEE International Conference on Computer Vision (ICCV'15), pp. 4489-4497, 2015.

L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin, X. Tang, L. V. Gool, "Temporal segment networks: Towards good practices for deep action recognition", European Conference on Computer Vision (ECCV'16), pp. 20-2016.

J. M. Chaquet, E. J. Carmona, A. Fernández-Caballero, "A survey of video datasets for human action and activity recognition", Computer Vision and Image Understanding, vol. 1no. 6, pp. 633-62013.

K. Soomro, A. R. Zamir, M. Shah, "UCF101: A dataset of human actions classes from videos in the wild", CoRR, vol. abs/1212.0402, 2012.

H. Kuehne, H. Jhuang, E. Garrote, T. A. Poggio, T. Serre, "HMDB: A large video database for human motion recognition", IEEE International Conference on Computer Vision (ICCV'11), pp. 2556-252011.

A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, F.-F. Li, "Large-scale video classification with convolutional neural networks", IEEE International Conference on Computer Vision and Pattern Recognition (CVPR'14), pp. 1725-172014.

D. P. Kingma, J. Ba, "Adam: A method for stochastic optimization", CoRR, vol. abs/1412.692015.

K. Simonyan, A. Zisserman, "Very deep convolutional networks for large-scale image recognition", CoRR, vol. abs/1409.152014.

K. He, X. Zhang, S. Ren, J. Sun, "Deep residual learning for image recognition", IEEE International Conference on Computer Vision and Pattern Recognition (CVPR'16), pp. 770-72016.

C. Feichtenhofer, A. Pinz, R. P. Wildes, "Spatiotemporal multiplier networks for video action recognition", IEEE International Conference on Computer Vision and Pattern Recognition (CVPR'17), pp. 7445-742017.

D. Tran, J. Ray, Z. Shou, S.-F. Chang, M. Paluri, "Convnet architecture search for spatiotemporal feature learning", CoRR, vol. abs/1708.0502017.

J. Carreira, A. Zisserman, "Quo vadis action recognition? A new model and the kinetics dataset", IEEE International Conference on Computer Vision and Pattern Recognition, pp. 4724-472017.

W. Kay, J. Carreira, K. Simonyan, B. Zhang, C. Hillier, S. Vijaya-Narasimhan, F. Viola, T. Green, T. Back, P. Natsev, M. Suleyman, A. Zisserman, "The kinetics human action video dataset", CoRR, vol. abs/1705.0692017.