Multi-Loss Recurrent Residual Networks for Gesture Detection and Recognition
Communication through gestures plays a relevant role in human life, in which a non-verbal language is used to propagate information among individuals. To recognize gestures, computers need to represent and interpret human appearance and motion, involving hands, arms, face, head and/or body, in a mathematical sense. Despite the high applicability in different contexts, most gesture recognition approaches in literature are not designed to deal with unsegmented videos. That is, most approaches do not temporally detect when a gesture occurs, which prevents to explore correlations between detection and recognition tasks, besides their application on real-world scenarios. In this sense, we propose the Multi-Loss Recurrent Residual Network (MLRRN), a multi-task based approach that performs both the recognition and temporal detection of gestures at once. It employs a dual loss function which takes into account the class assignment of each frame of a video to a gesture class and also determines the frame interval associated to each gesture. Our model counts with a dual input, gathering information from appearance and human pose on frames, besides bidirectional recurrent layers and residual modules. According to experiments conducted on ChaLearn Montalbano and ChaLearn ConGD datasets, our approach achieves results comparable to state-of-the-art methods considering average temporal Jaccard metric.
V. Pavlovic, R. Sharma, T. Huang, "Visual interpretation of hand gestures for human-computer interaction: A review", IEEE Trans. Pattern Anal. Mach. Intell., vol. no. 7, pp. 677-695, 1997.
S. Xu, Y. Xue, "A long term memory recognition framework on multi-complexity motion gestures", ICDAR, pp. 201-205, 2017.
H. Zhou, Q. Ruan, "A real-time gesture recognition algorithm on video surveillance", 8th International Conference on Signal Processing, vol. 3, no. 02, 2006.
P. Molchanov, X. Yang, S. Gupta, K. Kim, S. Tyree, J. Kautz, "Online detection and classification of dynamic hand gestures with recurrent 3d convolutional neural networks", Proceedings of 2016 IEEE CVPR, pp. 4207-422016.
N. Nishida, H. Nakayama, "Multimodal gesture recognition using multi-stream recurrent neural network", 7th Pacific-Rim Symposium on Image and Video Technology - Volume 94pp. 682-694, 2016.
C. Cao, Y. Zhang, Y. Wu, H. Lu, J. Cheng, "Egocentric gesture recognition using recurrent 3d convolutional neural networks with spatiotemporal transformer modules", 2017 IEEE ICCV, vol. 00, pp. 3783-3791, 2018.
D. Wu, L. Pigou, P. Kindermans, N. D. Le, L. Shao, J. Dambre, J. Odobez, "Deep dynamic neural networks for multimodal gesture segmentation and recognition", IEEE Trans. Pattern Anal. Mach. Intell., vol. no. 8, pp. 1583-1597, 2016.
L. Zhang, G. Zhu, P. Shen, J. Song, S. A. Shah, M. Bennamoun, "Learning spatiotemporal features using 3dcnn and convolutional lstm for gesture recognition", IEEE ICCV, 2017.
S. Escalera, X. Barao, J. Gonzalez, M. Bautista, M. Madadi, M. Reyes, V. Ponce-Lopez, H. Escalante, J. Shotton, I. Guyon, "ChaLearn LAP Challenge 20Dataset and Results", ECCV, 2014.
J. Wan, S. Z. Li, Y. Zhao, S. Zhou, I. Guyon, S. Escalera, "Chalearn looking at people rgb-d isolated and continuous datasets for gesture recognition", 2016 IEEE CVPR Workshops, pp. 761-72016.
H. Wang, P. Wang, Z. Song, W. Li, "Large-scale multimodal gesture recognition using heterogeneous networks", ICCV Workshops, pp. 3129-312017.
L. Pigou, M. V. Herreweghe, J. Dambre, "Gesture and sign language recognition with temporal residual networks", ICCV Workshops, pp. 3086-3093, 2017.
J. Y. Chang, "Nonparametric gesture labeling from multi-modal data", Computer Vision - ECCV 2014 Workshops, pp. 503-52015.
C. Monnier, S. German, A. Ost, "A multi-scale boosted detector for efficient and robust gesture recognition", ECCV Workshops, 2015.
I. L. O. Bastos, M. F. Angelo, A. Loula, "Recognition of static gestures applied to brazilian sign language (Iibras)", SIBGRAPI, 2015.
I. L. O. Bastos, V. H. C. Melo, G. R. Goncalves, W. R. Schwartz, "Mora: A generative approach to extract spatiotemporal information applied to gesture recognition", 15th International AVSS, 2018.
J. Duan, S. Zhou, J. Wan, X. Guo, S. Z. Li, "Multi-modality fusion based on consensus-voting and 3d convolution for isolated gesture recognition", CoRR, 2016.
L. Liu, L. Shao, "Learning discriminative representations from rgb-d video data", IJCAI, pp. 1493-1500, 2013.
G. Zhu, L. Zhang, P. Shen, J. Song, S. Shah, M. Bennamoun, "Continuous gesture segmentation and recognition using 3dcnn and convolutional lstm", IEEE Transactions on Multimedia, vol. 9, 2018.
D. Tran, J. Ray, Z. Shou, S.-F. Chang, M. Paluri, "Convnet architecture search for spatiotemporal feature learning", CoRR, vol. abs/1708.0502017.
Y. Song, D. Demirdjian, R. Davis, "Continuous body and hand gesture recognition for natural human-computer interaction", ACM Transactions on Interactive Intelligent Systems, vol. 2, pp. 5, 2012.
D. M. Gavrila, "The visual analysis of human movement: A survey", Computer Vision and Image Understanding, vol. pp. 82-98, 1999.
Z. Cao, T. Simon, S. Wei, Y. Sheikh, "Realtime multi-person 2d pose estimation using part affinity fields", 2017 IEEE CVPR, 2017.
N. Neverova, C. Wolf, G. W. Taylor, F. Nebout, "Moddrop: Adaptive multi-modal gesture recognition", IEEE Trans. Pattern Anal. Mach. Intell., vol. no. 8, pp. 1692-1706, 2016.
H. Wang, P. Wang, Z. Song, W. Li, "Large-scale multimodal gesture segmentation and recognition based on convolutional neural networks", ICCV Workshops, 2017.
Z. Liu, X. Chai, Z. Liu, X. Chen, "Continuous gesture recognition with hand-oriented spatiotemporal feature", ICCV Workshops, 2017.