Federated Learning and Mel-Spectrograms for Physical Violence Detection in Audio

Victor E. de S. Silva; Tiago B. Lacerda; Péricles Miranda; André Câmara; Amerson Riley Cabral Chagas; Ana Paula C. Furtado

Victor E. de S. Silva CESAR https://orcid.org/0000-0002-3756-2766
Tiago B. Lacerda CESAR https://orcid.org/0000-0003-1524-6604
Péricles Miranda UFRPE https://orcid.org/0000-0002-5767-7544
André Câmara UFRPE https://orcid.org/0000-0002-9333-3212
Amerson Riley Cabral Chagas CESAR https://orcid.org/0009-0007-3218-2960
Ana Paula C. Furtado UFRPE https://orcid.org/0000-0002-5439-5314

Resumo

Domestic violence has increased globally as the COVID-19 pandemic combines with economic and social stresses. Some works have used traditional feature extractors to identify features from sound signals to detect physical violence. However, these extractors have not performed well at recognizing physical violence in audio. Besides, the use of Machine Learning is limited by the trade-off between collecting more data while keeping users privacy. Federated Learning (FL) is a technique that allows the creation of client-server networks, in which anonymized training result can be uploaded to a central model, responsible for aggregating and keeping the model up to date, and then distribute the updated model to the client nodes. In this paper, we proposed a FL approach to the violence detection problem in audio signals. The framework was evaluated on a newly proposed synthetic dataset, in which audio signals are represented as mel-spectrograms images, augmented with violence extracts. Thereby, it treats it as a problem of image classification using pre-trained Convolutional Neural Networks (CNN). Inception v3, MobileNet v2, ResNet152 v2 and VGG-16 architectures were evaluated, with the MobileNet architecture presenting the best performance, in terms of accuracy (71.9%), with a loss of 3.6% when compared to the non-FL setting.