Deep Learning and Mel-spectrograms for Physica Violence Detection in Audio
Resumo
Há um crescente interesse em sistemas de detecção de violência de forma automática por meio do áudio ambiente. Neste trabalho, construímos e avaliamos 4 classificadores com essa proposta. Porém, em vez de processar diretamente os sinais de áudio, nós os convertemos para imagens, conhecidas como mel-spectrograms, e em seguida utilizamos Redes Neurais Convolucionais (CNN) para tratar como um problema de classificação de imagens utilizando-se de redes pre-treinadas neste contexto. Testou-se as arquiteturas Inception v3, VGG-16, MobileNet v2 e ResNet152 v2, tendo o classificador oriundo da arquitetura MobileNet obtido os melhores resultados de classificação, quando avaliado no HEAR Dataset, criado para a realização desta pesquisa.
Referências
Chen, G., Parada, C., and Heigold, G. (2014). Small-footprint keyword spotting using deep neural networks. In 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4087–4091. IEEE.
Crocco, M., Cristani, M., Trucco, A., and Murino, V. (2016). Audio surveillance: A systematic review. 48(4):1–46.
Dietterich, T. G. (1998). Approximate statistical tests for comparing supervised classification learning algorithms. 10(7):1895–1923.
Durães, D., Marcondes, F. S., Gonçalves, F., Fonseca, J., Machado, J., and Novais, P. (2021). Detection violent behaviors: A survey. In Novais, P., Vercelli, G., Larriba-Pey, J. L., Herrera, F., and Chamoso, P., editors, Ambient Intelligence – Software and Applications, pages 106–116. Springer International Publishing.
Fonseca, E., Favory, X., Pons, J., Font, F., and Serra, X. (2020). FSD50k: an open dataset of human-labeled sound events.
Fonseca, E., Pons, J., Favory, X., Font, F., Bogdanov, D., Ferraro, A., Oramas, S., Porter, A., and Serra, X. (2017). Freesound datasets: A platform for the creation of open audio datasets. page 8.
Fukushima, K. (1980). Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position. 36(4).
Gemmeke, J. F., Ellis, D. P. W., Freedman, D., Jansen, A., Lawrence, W., Moore, R. C., Plakal, M., and Ritter, M. (2017). Audio set: An ontology and human-labeled dataset for audio events. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 776–780. IEEE.
Giannakopoulos, T., Kosmopoulos, D., Aristidou, A., and Theodoridis, S. (2006). Violence content classification using audio features. In Antoniou, G., Potamias, G., Spyropoulos, C., and Plexousakis, D., editors, Advances in Artificial Intelligence, volume 3955, pages 502–507. Springer Berlin Heidelberg. Series Title: Lecture Notes in Computer Science.
He, K., Zhang, X., Ren, S., and Sun, J. (2015). Deep residual learning for image recognition.
Khan, A., Sohail, A., Zahoora, U., and Qureshi, A. S. (2020). A survey of the recent architectures of deep convolutional neural networks. 53(8):5455–5516.
Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., and Plumbley, M. D. (2020). PANNs: Largescale pretrained audio neural networks for audio pattern recognition. 28:2880–2894.
Krizhevsky, A., Sutskever, I., and Hinton, G. E. (2017). ImageNet classification with deep convolutional neural networks. 60(6):84–90.
Mesaros, A., Heittola, T., and Virtanen, T. (2016). Metrics for polyphonic sound event detection. 6(6):162.
Nordby, J. (2019). Environmental sound classification on microcontrollers using convolutional neural networks. page 70.
Piczak, K. J. (2015). ESC: Dataset for environmental sound classification. In Proceedings of the 23rd ACM international conference on Multimedia, pages 1015–1018. ACM.
Rouas, J.-L., Louradour, J., and Ambellouis, S. (2006). Audio events detection in public transport vehicle. In 2006 IEEE Intelligent Transportation Systems Conference, pages 733–738. IEEE.
Salamon, J., Jacoby, C., and Bello, J. P. (2014). A dataset and taxonomy for urban sound research. In Proceedings of the 22nd ACM international conference on Multimedia, pages 1041–1044. ACM.
Salamon, J., MacConnell, D., Cartwright, M., Li, P., and Bello, J. P. (2017). Scaper: A library for soundscape synthesis and augmentation. In 2017 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pages 344–348. IEEE.
Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., and Chen, L.-C. (2018). MobileNetV2: Inverted residuals and linear bottlenecks. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4510–4520. IEEE.
Simonyan, K. and Zisserman, A. (2015). Very deep convolutional networks for large-scale image recognition.
Souto, H., Mello, R., and Furtado, A. (2019). An acoustic scene classification approach involving domestic violence using machine learning. In Anais do ENIAC, pages 705–716.
Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., and Wojna, Z. (2015). Rethinking the inception architecture for computer vision.
Wu, P., Liu, J., Shi, Y., Sun, Y., Shao, F., Wu, Z., and Yang, Z. (2020). Not only look, but also listen: Learning multimodal violence detection under weak supervision.