Handcrafted vs. Learned Features for Automatically Detecting Violence in Surveillance Footage

Arnaldo V. Barros da Silva; Luis F. Alves Pereira

doi:10.5753/semish.2022.222887

Arnaldo V. Barros da Silva UFAPE
Luis F. Alves Pereira UFAPE

DOI: https://doi.org/10.5753/semish.2022.222887

Resumo

For many years, methods for detecting violence in video data used features designed by humans to extract visual information from input frames for composing feature vectors and then applied machine learning techniques to assign labels to them. Recently, Deep Learning methods are highly evidenced for this task since they can automatically learn image features. Furthermore, they usually overcome the accuracy rates obtained by classical methods based on handcrafted features. This work evaluates learned and handcrafted features for classifying video frames as 'violence' or 'non-violence'. Our results showed that learned features can not always be claimed superior since some violent scenes are only detected by handcrafted features.

Palavras-chave: video understanding, violence detection, computer vision, handcrafted features, learned features

Referências

Alhindi, T. J., Kalra, S., Ng, K. H., Afrin, A., and Tizhoosh, H. R. (2018). Comparing lbp, hog and deep features for classification of histopathology images. In 2018 International Joint Conference on Neural Networks (IJCNN), pages 1–7.

Antipov, G., Berrani, S. A., Ruchaud, N., and Dugelay, J.-L. (2015). Learned vs. handcrafted features for pedestrian gender recognition.

Arroyo, R., Yebes, J. J., Bergasa, L. M., Daza, I. G., and Almazán, J. (2015). Expert video-surveillance system for real-time detection of suspicious behaviors in shopping malls. Expert systems with Applications, 42(21):7991–8005.

Chen, J.-H., Tseng, T.-H., Lai, C.-L., and Hsieh, S.-T. (2012). An intelligent virtual fence security system for the detection of people invading. 9th International Conference on Ubiquitous Intelligence and Computing and 9th International Conference on Autonomic and Trusted Computing, pages 786–791.

Chen, M.-y. and Hauptmann, A. (2009). Mosift: Recognizing human actions in surveillance videos.

Cheng, M., Cai, K., and Li, M. (2020). Rwf-2000: An open large scale video database for violence detection.

Cosar, S., Donatiello, G., Bogorny, V., Garate, C., Alvares, L. O., and Brémond, F. (2016). Toward abnormal trajectory and event detection in video surveillance. IEEE Transactions on Circuits and Systems for Video Technology, 27(3):683–695.

De Souza, F. D., Chavez, G. C., do Valle Jr, E. A., and Araújo, A. d. A. (2010). Violence detection in video using spatio-temporal features. 23rd SIBGRAPI Conference on Graphics, Patterns and Images (SIBGRAPI), pages 224–230.

Delgado, B., Tahboub, K., and Delp, E. J. (2014). Automatic detection of abnormal human events on train platforms. IEEE National Aerospace and Electronics Conference, pages 169–173.

Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei, L. (2009). Imagenet: A large-scale hierarchical image database. IEEE conference on computer vision and pattern recognition, pages 248–255.

Deniz, O., Serrano Gracia, I., Bueno, G., and Kim, T.-T. (2014). Fast violence detection in video. volume 2.

Dong, Z., Qin, J., and Wang, Y. (2016). Multi-stream deep networks for person to person violence detection in videos. Chinese Conference on Pattern Recognition, pages 517– 531.

Farnebäck, G. (2003). Two-frame motion estimation based on polynomial expansion. In: Image analysis, 2749:363–370.

Forgy, E. W. (1965). Cluster analysis of multivariate data : efficiency versus interpretability of classifications. Biometrics, 21:768–769.

Gao, Y., Liu, H., Sun, X.,Wang, C., and Liu, Y. (2016). Violence detection using oriented violent flows. Image and vision computing, 48:37–41.

Hassner, T., Itcher, Y., and Kliper-Gross, O. (2012). Violent flows: Real-time detection of violent crowd behavior. IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pages 1–6.

Hearst, M. A., Dumais, S. T., Osuna, E., Platt, J., and Scholkopf, B. (1998). Support vector machines. IEEE Intelligent Systems and their applications, 13(4):18–28.

Jégou, H., Douze, M., and Schmid, C. (2009). Packing bag-of-features. IEEE 12th International Conference on Computer Vision, pages 2357–2364.

Keçeli, A. and Kaya, A. (2017). Violent activity detection with transfer learning method. Electronics Letters, 53(15):1047–1048.

Krausz, B. and Bauckhage, C. (2012). Loveparade 2010: Automatic video analysis of a crowd disaster. Computer Vision and Image Understanding, 116(3):307–319.

Laptev, I., Marszalek, M., Schmid, C., and Rozenfeld, B. (2008). Learning realistic human actions from movies. In 2008 IEEE Conference on Computer Vision and Pattern Recognition, pages 1–8.

Lowe, D. G. (2004). Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vision, 60(2):91–110.

Lu, N., Wu, Y., Feng, L., and Song, J. (2018). Deep learning for fall detection: Threedimensional cnn combined with lstm on video kinematic data. IEEE journal of biomedical and health informatics, 23(1):314–323.

Maaten, L. v. d. and Hinton, G. (2008). Visualizing data using t-sne. Journal of machine learning research, 9(Nov):2579–2605.

Nanni, L., Ghidoni, S., and Brahnam, S. (2017). Handcrafted vs non-handcrafted features for computer vision classification. Pattern Recognition, 71.

Nievas, E. B., Suarez, O. D., García, G. B., and Sukthankar, R. (2011). Violence detection in video using computer vision techniques. International conference on Computer analysis of images and patterns, pages 332–339.

Nowak, E., Jurie, F., and Triggs, B. (2006). Sampling strategies for bag-of-features image classification. In European conference on computer vision, pages 490–503. Springer.

Ribeiro, P. C., Audigier, R., and Pham, Q. C. (2016). Rimoc, a feature to discriminate unstructured motions: Application to violence detection for video-surveillance. Computer vision and image understanding, 144:121–143.

Rougier, C., Meunier, J., St-Arnaud, A., and Rousseau, J. (2011). Robust video surveillance for fall detection based on human shape deformation. IEEE Transactions on circuits and systems for video Technology, 21(5):611–622.

Saba, T. (2021). Computer vision for microscopic skin cancer diagnosis using handcrafted and non-handcrafted features. Microscopy Research and Technique, 84:1272 – 1283.

Simonyan, K. and Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556.

Soliman, M. M., Kamal, M. H., El-Massih Nashed, M. A., Mostafa, Y. M., Chawky, B. S., and Khattab, D. (2019). Violence recognition from videos using deep learning techniques. In 2019 Ninth International Conference on Intelligent Computing and Information Systems (ICICIS), pages 80–85.

Sudhakaran, S. and Lanz, O. (2017). Learning to detect violent videos using convolutional long short-term memory. 14th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), pages 1–6.

Van Gool, L. (2008). Action snippets: How many frames does human action recognition require? Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on.

Wen, L., Li, X., Li, X., and Gao, L. (2019). A new transfer learning based on vgg-19 network for fault diagnosis. In 2019 IEEE 23rd International Conference on Computer Supported Cooperative Work in Design (CSCWD), pages 205–209.

Xu, D., Yan, Y., Ricci, E., and Sebe, N. (2017). Detecting anomalous events in videos by learning deep representations of appearance and motion. Computer Vision and Image Understanding, 156:117–127.

Yang, S., Luo, P., Loy, C. C., and Tang, X. (2015). From facial parts responses to face detection: A deep learning approach. pages 3676–3684.

Zhang, X., Shu, X., and He, Z. (2019). Crowd panic state detection using entropy of the distribution of enthalpy. Physica A: Statistical Mechanics and its Applications, 525:935–945.

Zhou, P., Ding, Q., Luo, H., and Hou, X. (2017). Violent interaction detection in video based on deep learning. Journal of Physics: Conference Series, 844:012044.