Motion-Based Representations For Activity Recognition
ResumoThis work addresses the activity recognition problem. We propose two different representations based on motion information for activity recognition. The first representation is a novel temporal stream for two-stream Convolutional Neural Networks (CNNs) that receives as input images computed from the optical flow magnitude and orientation to learn the motion in a better and richer manner. The method applies simple non-linear transformations on the vertical and horizontal components of the optical flow to generate input images for the temporal stream. The second representation is a novel skeleton image representation to be used as input of CNNs. The approach encodes the temporal dynamics by explicitly computing the magnitude and orientation values of the skeleton joints. Experiments carried out on challenging well-known activity recognition datasets (UCF101, NTU RGB+D 60 and NTU RGB+D 120) demonstrate that the proposed representations achieve results in the state of the art, indicating the suitability of our approaches as video representations.
K. Simonyan and A. Zisserman, "Two-stream Convolutional Networks for Action Recognition in Videos," in NIPS, 2014.
Y. Du, Y. Fu, and L. Wang, "Skeleton based action recognition with convolutional neural network," in IAPR Asian Conference on Pattern Recognition (ACPR), 2015.
L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin, X. Tang, and L. Van Gool, "Temporal Segment Networks: Towards Good Practices for Deep Action Recognition," in ECCV, 2016.
J. Carreira and A. Zisserman, "Quo vadis, action recognition? A new model and the Kinetics dataset," in CVPR, 2017.
Z. Yang, Y. Li, J. Yang, and J. Luo, "Action recognition with spatio- temporal visual attention on skeleton image sequences," IEEE Transac- tions on Circuits and Systems for Video Technology, 2018.
L. Wang, Y. Xiong, Z. Wang, and Y. Qiao, "Towards Good Practices for Very Deep Two-Stream ConvNets," CoRR, 2015.
P. Wang, Z. Li, Y. Hou, and W. Li, "Action recognition based on joint trajectory maps using convolutional neural networks," in ACM International Conference on Multimedia (MM), 2016.
M. Liu, C. Chen, and H. Liu, "3d action recognition using data visualization and convolutional neural networks," in IEEE International Conference on Multimedia Expo Workshops (ICME), 2017.
Q. Ke, M. Bennamoun, S. An, F. Sohel, and F. Boussaid, "A new representation of skeleton sequences for 3d action recognition," in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
C. Li, Q. Zhong, D. Xie, and S. Pu, "Skeleton-based action recognition with convolutional neural networks," in IEEE International Conference on Multimedia Expo Workshops (ICMEW), 2017.
——, "Co-occurrence feature learning from skeleton data for action recognition and detection with hierarchical aggregation," in International Joint Conference on Artiﬁcial Intelligence (IJCAI), 2018.
K. Soomro, A. R. Zamir, and M. Shah, "UCF101: A Dataset of 101 Human Action Classes From Videos in The Wild," CRCV-TR, Tech. Rep., 2012.
C. Zach, T. Pock, and H. Bischof, "A Duality Based Approach for Realtime TV-L1 Optical Flow," in Proceedings of the 29th DAGM Conference on Pattern Recognition, 2007.
A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei, "Large-Scale Video Classiﬁcation with Convolutional Neural Networks," in CVPR, 2014.
N. Srivastava, E. Mansimov, and R. Salakhutdinov, "Unsupervised Learning of Video Representations Using LSTMs," in ICML, 2015.
D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, "Learning Spatiotemporal Features With 3D Convolutional Networks," in ICCV, 2015.
L. Sun, K. Jia, D. Y. Yeung, and B. E. Shi, "Human Action Recognition Using Factorized Spatio-Temporal Convolutional Networks," in ICCV, 2015.
C. Feichtenhofer, A. Pinz, and A. Zisserman, "Convolutional two-stream network fusion for video action recognition," in CVPR, 2016.
W. Zhu, J. Hu, G. Sun, X. Cao, and Y. Qiao, "A Key Volume Mining Deep Framework for Action Recognition," in CVPR, 2016.
Q. Liu, X. Che, and M. Bie, "R-stan: Residual spatial-temporal attention network for action recognition," IEEE Access, 2019.
B. Jiang, B. Jiang, W. Gan, W. Wu, and J. Yan, "STM: SpatioTemporal and Motion Encoding for Action Recognition," in ICCV, 2019.
R. Jain, The art of computer systems performance analysis - techniques for experimental design, measurement, simulation, and modeling. Wi- ley, 1991.
A. Shahroudy, J. Liu, T. Ng, and G. Wang, "Ntu rgb+d: A large scale dataset for 3d human activity analysis," in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
J. Liu, A. Shahroudy, M. Perez, G. Wang, L.-Y. Duan, and A. C. Kot, "Ntu rgb+d 120: A large-scale benchmark for 3d human activity understanding," IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019.
C. A. Caetano, F. Br´emond, and W. R. Schwartz, "Skeleton image representation for 3d action recognition based on tree structure and reference joints," in 2019 32th SIBGRAPI Conference on Graphics, Patterns and Images (SIBGRAPI), 2019.
M. Liu and J. Yuan, "Recognizing human actions as the evolution of pose estimation maps," in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
J. Hu, W. Zheng, J. Lai, and J. Zhang, "Jointly learning heterogeneous features for rgb-d activity recognition," IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017.
J. Liu, A. Shahroudy, D. Xu, A. C. Kot, and G. Wang, "Skeleton-based action recognition using spatio-temporal lstm network with trust gates," IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018.
J. Liu, G. Wang, P. Hu, L. Duan, and A. C. Kot, "Global context- aware attention lstm networks for 3d action recognition," in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
J. Liu, A. Shahroudy, G. Wang, L. Duan, and A. Kot Chichung, "Skeleton-based online action prediction using scale selection network," IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019.
J. Liu, G. Wang, L. Duan, K. Abdiyeva, and A. C. Kot, "Skeleton- based human action recognition with global context-aware attention lstm networks," IEEE Transactions on Image Processing, 2018.
Q. Ke, M. Bennamoun, S. An, F. Sohel, and F. Boussaid, "Learning clip representations for skeleton-based 3d action recognition," IEEE Transactions on Image Processing, 2018.
C. Caetano, J. A. dos Santos, and W. R. Schwartz, "Optical Flow Co-occurrence Matrices: A novel spatiotemporal feature descriptor," in International Conference on Pattern Recognitio (ICPR), 2016.
C. A. Caetano, V. H. C. D. Melo, J. A. dos Santos, and W. R. Schwartz, "Activity recognition based on a magnitude-orientation stream network," in 2017 30th SIBGRAPI Conference on Graphics, Patterns and Images (SIBGRAPI), 2017.
C. Caetano, J. A. dos Santos, and W. R. Schwartz, "Statistical measures from co-occurrence of codewords for action recognition," in Proceedings of the 13th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications - Volume 5: VISAPP,, 2018.
C. A. Caetano, J. Sena, F. Br´emond, J. A. dos Santos, and W. R. Schwartz, "Skelemotion: A new representation of skeleton joint se- quences based on motion information for 3d action recognition," in 2019 16th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), 2019.
C. Caetano, V. H. de Melo, F. Bremond, J. A. dos Santos, and W. R. ´ Schwartz, "Magnitude-orientation stream network and depth information applied to activity recognition," Journal of Visual Communication and Image Representation, 2019.