Visual Rhythm-based Convolutional Neural Networks and Adaptive Fusion for a Multi-stream Architecture Applied to Human Action Recognition

Helena de Almeida Maia; Marcelo Bernardes Vieira; Helio Pedrini

doi:10.5753/sibgrapi.est.2021.20016

Helena de Almeida Maia UNICAMP
Marcelo Bernardes Vieira UFJF
Helio Pedrini UNICAMP

DOI: https://doi.org/10.5753/sibgrapi.est.2021.20016

Resumo

In this work, we address the problem of human action recognition in videos. We propose and analyze a multistream architecture containing image-based networks pre-trained on the large ImageNet. Different image representations are extracted from the videos to feed the streams, in order to provide complementary information for the system. Here, we propose new streams based on visual rhythm that encodes longer-term information when compared to still frames and optical flow. Our main contribution is a stream based on a new variant of the visual rhythm called Learnable Visual Rhythm (LVR) formed by the outputs of a deep network. The features are collected at multiple depths to enable the analysis of different abstraction levels. This strategy significantly outperforms the handcrafted version on the UCF101 and HMDB51 datasets. We also investigate many combinations of the streams to identify the modalities that better complement each other. Experiments conducted on the two datasets show that our multi-stream network achieved competitive results compared to state-of-the-art approaches.

Referências

W. Sultani, C. Chen, and M. Shah, “Real-world Anomaly Detection in Surveillance Videos,” in CVPR, 2018, pp. 6479–6488.

S. Ji, W. Xu, M. Yang, and K. Yu, “3D Convolutional Neural Networks for Human Action Recognition,” TPAMI, vol. 35, no. 1, pp. 221–231, 2013.

I. Gori, J. K. Aggarwal, L. Matthies, and M. S. Ryoo, “Multitype Activity Recognition in Robot-Centric Scenarios,” IEEE Robotics and Automation Letters, vol. 1, no. 1, pp. 593–600, Jan. 2016.

M. S. Ryoo and L. Matthies, “First-Person Activity Recognition: Feature, Temporal Structure, and Prediction,” IJCV, vol. 119, no. 3, pp. 307–328, Sep. 2016.

S. M. Amiri, M. T. Pourazad, P. Nasiopoulos, and V. C. Leung, “Nonintrusive Human Activity Monitoring in a Smart Home Environment,” in International Conference on e-Health Networking, Applications and Services. IEEE, 2013, pp. 606–610.

B. Kwolek and M. Kepski, “Human Fall Detection on Embedded Platform Using Depth Maps and Wireless Accelerometer,” Computer Methods and Programs in Biomedicine, vol. 117, no. 3, pp. 489–501, 2014.

G. Leite, G. Silva, and H. Pedrini, “Fall Detection in Video Sequences Based on a Three-Stream Convolutional Neural Network,” in ICMLA. IEEE, 2019, pp. 191–195.

A. C. Sintes, “Learning to Recognize Human Actions: from Handcrafted to Deep-learning Based Visual Representations,” Ph.D. dissertation, Departament de Matemàtiques i Informàtica, Universitat de Barcelona, Barcelona, Spain, 2018.

K. Simonyan and A. Zisserman, “Two-Stream Convolutional Networks for Action Recognition in Videos,” in NIPS, 2014.

J. Carreira and A. Zisserman, “Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset,” in CVPR. IEEE, 2017.

J. Zhu, Z. Zhu, and W. Zou, “End-to-End Video-level Representation Learning for Action Recognition,” in ICPR. IEEE, 2018, pp. 645–650.

A. Diba, V. Sharma, and L. Van Gool, “Deep Temporal Linear Encoding Networks,” in CVPR, 2017.

H. Maia, “Visual Rhythm-based Convolutional Neural Networks and Adaptive Fusion for a Multi-stream Architecture Applied to Human Action Recognition,” Ph.D. dissertation, Institute of Computing, University of Campinas, Campinas, Brazil, 2020.

D. Concha, H. Maia, H. Pedrini, H. Tacon, A. Brito, H. Chaves, and M. Vieira, “Multi-Stream Convolutional Neural Networks for Action Recognition in Video Sequences Based on Adaptive Visual Rhythms,” in ICMLA. IEEE, 2018.

L. Wang, Y. Xiong, Z. Wang, and Y. Qiao, “Towards Good Practices for very Deep Two-Stream Convnets,” arXiv preprint arXiv:1507.02159, 2015.

M. R. Souza, “Digital Video Stabilization: Algorithms and Evaluation,” Master’s thesis, Institute of Computing, University of Campinas, Campinas, Brazil, 2018.

H. Maia, D. Concha, H. Pedrini, H. Tacon, A. Brito, H. Chaves, M. Vieira, and S. Villela, “Action Recognition in Videos Using Multi- Stream Convolutional Neural Networks,” in DLAPP. Springer, 2020.

W. Kay, J. Carreira, K. Simonyan, B. Zhang, C. Hillier, S. Vijayanarasimhan, F. Viola, T. Green, T. Back, P. Natsev, M. Suleyman, and A. Zisserman, “The Kinetics Human Action Video Dataset,” arXiv preprint arXiv:1705.06950, 2017.

H. Maia, M. Souza, A. Santos, H. Pedrini, H. Tacon, A. Brito, H. Chaves, M. Vieira, and S. Villela, “Learnable Visual Rhythms Based on the Stacking of Convolutional Neural Networks for Action Recognition,” in ICMLA. IEEE, 2019.

K. Soomro, A. R. Zamir, and M. Shah, “UCF101: A Dataset of 101 Human Actions Classes from Videos in the Wild,” arXiv preprint arXiv:1212.0402, 2012.

H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre, “HMDB: A Large Video Database for Human Motion Recognition,” in ICCV, 2011.

Y. Bo, Y. Lu, and W. He, “Few-Shot Learning of Video Action Recognition Only Based on Video Contents,” in WACV, March 2020.

V. Choutas, P. Weinzaepfel, J. Revaud, and C. Schmid, “PoTion: Pose MoTion Representation for Action Recognition,” in CVPR, 2018.

D. Tran, H. Wang, L. Torresani, J. Ray, Y. LeCun, and M. Paluri, “A Closer Look at Spatiotemporal Convolutions for Action Recognition,” in CVPR, 2018, pp. 6450–6459.

J. Wang, A. Cherian, F. Porikli, and S. Gould, “Video Representation Learning Using Discriminative Pooling,” in CVPR, 2018, pp. 1149– 1158.

J. Y.-H. Ng, M. Hausknecht, S. Vijayanarasimhan, O. Vinyals, R. Monga, and G. Toderici, “Beyond Short Snippets: Deep Networks for Video Classification,” in CVPR, 2015, pp. 4694–4702.

L. Wang, Y. Qiao, and X. Tang, “Action Recognition with Trajectory- Pooled Deep-Convolutional Descriptors,” in CVPR, 2015, pp. 4305– 4314.

W. Zhu, J. Hu, G. Sun, X. Cao, and Y. Qiao, “A Key Volume Mining Deep Framework for Action Recognition,” in CVPR. IEEE, 2016, pp. 1991–1999.

C. Feichtenhofer, A. Pinz, and A. Zisserman, “Convolutional Two-stream Network Fusion for Video Action Recognition,” in CVPR, 2016, pp. 1933–1941.

L. Wang et al., “Temporal Segment Networks: Towards Good Practices for Deep Action Recognition,” in ECCV. Springer, 2016.

S. Yu, Y. Cheng, L. Xie, Z. Luo, M. Huang, and S. Li, “A Novel Recurrent Hybrid Network for Feature Fusion in Action Recognition,” JVCIR, 2017.

L. Sun, K. Jia, K. Chen, D. Y. Yeung, B. E. Shi, and S. Savarese, “Lattice Long Short-Term Memory for Human Action Recognition,” in ICCV, 2017, pp. 2166–2175.

H. Wang, Y. Yang, E. Yang, and C. Deng, “Exploring Hybrid Spatio- Temporal Convolutional Networks for Human Action Recognition,” MTA, 2017.

Y. Wang, M. Long, J. Wang, and P. S. Yu, “Spatiotemporal Pyramid Network for Video Action Recognition,” in CVPR. IEEE, 2017.

G. Varol, I. Laptev, and C. Schmid, “Long-Term Temporal Convolutions for Action Recognition,” TPAMI, vol. 40, no. 6, pp. 1510–1517, 2018.

J. Zhu, W. Zou, and Z. Zhu, “Two-stream Gated Fusion Convnets for Action Recognition,” in ICPR. IEEE, 2018.

H. Bilen, B. Fernando, E. Gavves, and A. Vedaldi, “Action Recognition with Dynamic Image Networks,” TPAMI, 2017.

E. Chen, X. Bai, L. Gao, H. C. Tinega, and Y. Ding, “A Spatiotemporal Heterogeneous Two-stream Network for Action Recognition,” IEEE Access, vol. 7, pp. 57 267–57 275, 2019.

A. C. S. Santos, H. A. Maia, M. R. Souza, M. B. Vieira, and H. Pedrini, “Fuzzy Fusion for Two-stream Action Recognition,” in VISAPP. INSTICC, 2020.

Y. Li, B. Ji, X. Shi, J. Zhang, B. Kang, and L. Wang, “TEA: Temporal Excitation and Aggregation for Action Recognition,” in CVPR, 2020.

A. Brito, M. Vieira, S. Villela, H. Tacon, H. Chaves, H. Maia, D. Concha, and H. Pedrini, “Weighted Voting of Multi-Stream Convolutional Neural Networks for Video-Based Action Recognition using Optical Flow Rhythms,” JVCIR, 2020.

H. Tacon, A. Brito, H. Chaves, M. Vieira, S. Villela, H. Maia, D. Concha, and H. Pedrini, “Human Action Recognition Using Convolutional Neural Networks with Symmetric Time Extension of Visual Rhythms,” in ICCSA. Springer, 2019.

H. Tacon, A. Brito, H. L. Chaves, M. B. Vieira, S. M. Villela, H. A. Maia, D. T. Concha, and H. Pedrini, “Multi-stream Architecture with Symmetric Extended Visual Rhythms for Deep Learning Human Action Recognition,” in VISAPP, 2020, pp. 351–358.

H. Chaves, K. Ribeiro, A. Brito, H. Tacon, M. Vieira, A. Cerqueira, S. Villela, H. Maia, D. Concha, and H. Pedrini, “Filter Learning from Deep Descriptors of a Fully Convolutional Siamese Network for Tracking in Videos,” in VISAPP. INSTICC, 2020.

H. Maia, M. Souza, A. S. Santos, J. Bobadilla, M. Vieira, and H. Pedrini, “Early Stopping for Two-Stream Fusion Applied to Action Recognition,” in Springer Book of VISAPP 2020, 2020, [Submitted].

V. C. Lobo-Neto, H. de Almeida Maia, M. R. e Souza, J. C. M. Bobadilla, and H. Pedrini, “Direct Optical Flow Attacks in a Two-Stream Network for Robustness Evaluation,” Computer Vision and Image Understanding, 2021, [Submitted].

M. Souza, H. Maia, and H. Pedrini, “Survey on Digital Video Stabilization: Concepts, Methods and Challenges,” ACM Computing Surveys, 2021, [Submitted].

Visual Rhythm-based Convolutional Neural Networks and Adaptive Fusion for a Multi-stream Architecture Applied to Human Action Recognition

Resumo

Referências

Artigos mais lidos do(s) mesmo(s) autor(es)