Uso de Características Temporais e Semânticas para Detectar Eventos em Vídeos de Violência Urbana

  • Saul Sousa da Rocha UFPI
  • Carlos Henrique Vale e Silva UFPI
  • Mateus José da Silva UFPI
  • Jose Rodrigues Torres Neto UFPI
  • Carlos Henrique G. Ferreira UFOP
  • Glauber Dias Gonçalves UFPI

Resumo


Videos published on platforms such as YouTube play a central role in covering urban violence events, but the high lexical similarity between different occurrences and the temporal proximity of postings hinder the automatic identification of which videos refer to the same event. Previous approaches predominantly explore semantic characteristics extracted from video, audio, and metadata, often relying on supervised techniques that require high computational costs and extensive annotations, making them impractical for continuous large-scale monitoring. Addressing this problem is of great importance for applications such as integrating multimedia records in investigations, fact-checking, assessing the impact of events, and building reliable historical archives. This work proposes two unsupervised and complementary heuristics: one based on temporal characteristics and anchor entities extracted via named entity recognition, and another based on multiple semantic attributes integrated into a similarity graph. We evaluate these approaches on a novel dataset of more than 1,400 manually annotated videos, collected between 2019 and 2024, covering both low- and high-impact events. Results show that the temporal heuristic consistently outperforms both the semantic heuristic and GPT-4 as a baseline, achieving accuracy up to 0.90 and NMI up to 0.98, while the semantic heuristic performs better in sparse-event scenarios. We also find that including full transcripts brings no substantial gains, indicating that titles and descriptions already contain the most relevant information for the task. These findings reinforce the potential of simple, low-cost, domain-adapted solutions to outperform generic approaches in challenging video clustering scenarios.

Palavras-chave: video clustering, urban violence, event detection, unsupervised learning, YouTube analysis

Referências

2025. Atlas da Violência 2025. [link] Acesso em: julho de 2025.

2025. Global Peace Index 2025: Measuring Peace in a Complex World. [link] Acesso em: julho de 2025.

Ziyad Amer and Michelle D Davies. 2025. Context-Enriched Named Entity Recognition (NER) for Identifying Emerging Trends in Video Comments. University of California, Berkeley (2025).

Murali Raghu Babu Balusu, Taha Merghani, and Jacob Eisenstein. 2018. Stylistic Variation in Social Media Part-of-Speech Tagging. arXiv:1804.07331 [cs.CL] [link]

Thales Felipe Costa Bertaglia and Maria das Graças Volpe Nunes. 2017. Exploring word embeddings for unsupervised textual user-generated content normalization. arXiv preprint arXiv:1704.02963 (2017).

Léopaul Boesinger, Manoel Horta Ribeiro, Veniamin Veselovsky, and RobertWest. 2024. Tube2Vec: Social and Semantic Embeddings of YouTube Channels. In Proceedings of the International AAAI Conference on Web and Social Media, Vol. 18. 2084–2090.

Celia Chen, Scotty Beland, Ingo Burghardt, Jill Byczek, William J. Conway, Eric Cotugno, Sadaf Davre, Megan Fletcher, Rajesh Kumar Gnanasekaran, Kristin Hamilton, Jordan Heustis, Tanaya Jha, Emily Klein, Hayden Kramer, Alex Leitch, Jessica Perkins, Casi Sherman, Celia Sterrn, Logan Stevens, Rebecca Zarrella, and Jennifer Golbeck. 2025. Cross-Platform Violence Detection on Social Media: A Dataset and Analysis. In Proceedings of the 17th ACM Web Science Conference 2025 (Websci ’25). Association for Computing Machinery, New York, NY, USA, 494–498. DOI: 10.1145/3717867.3717877

Bruno Degardin and Hugo Proença. 2021. Iterative weak/self-supervised classification framework for abnormal events detection. Pattern Recognition Letters 145 (2021), 50–57.

Mohamed Elhoseiny, Jingen Liu, Hui Cheng, Harpreet Sawhney, and Ahmed Elgammal. 2016. Zero-shot event detection by multimodal distributional semantic embedding of videos. In Proceedings of the AAAI conference on artificial intelligence, Vol. 30.

Yuan Ge, Yilun Liu, Chi Hu, Weibin Meng, Shimin Tao, Xiaofeng Zhao, Hongxia Ma, Li Zhang, Boxing Chen, Hao Yang, et al. 2024. Clustering and ranking: Diversity-preserved instruction selection through expert-aligned quality estimation. arXiv preprint arXiv:2402.18191 (2024).

Yizheng Huang and Jimmy X. Huang. 2024. Exploring ChatGPT for nextgeneration information retrieval: Opportunities and challenges. Web Intelligence 22, 1 (2024), 31–44. DOI: 10.3233/WEB-230363 arXiv: [link]

Paul Jaccard. 1912. THE DISTRIBUTION OF THE FLORA IN THE ALPINE ZONE. New Phytologist 11, 2 (1912), 37–50. DOI: 10.1111/j.1469-8137.1912.tb05611.x arXiv: [link]

Aren Jansen, Jort F Gemmeke, Daniel PW Ellis, Xiaofeng Liu, Wade Lawrence, and Dylan Freedman. 2017. Large-scale audio event discovery in one million youtube videos. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 786–790.

Imed Keraghel, Stanislas Morbieu, and Mohamed Nadif. 2024. A survey on recent advances in named entity recognition. arXiv preprint arXiv:2401.10825 (2024).

HaroldWKuhn. 1955. The Hungarian method for the assignment problem. Naval research logistics quarterly 2, 1-2 (1955), 83–97.

Jing Li, Aixin Sun, Jianglei Han, and Chenliang Li. 2022. A Survey on Deep Learning for Named Entity Recognition. IEEE Transactions on Knowledge and Data Engineering 34, 1 (Jan. 2022), 50–70. DOI: 10.1109/tkde.2020.2981314

Jinghui Lu, Ziwei Yang, YanjieWang, Xuejing Liu, and Can Huang. 2024. Padellmner: Parallel decoding in large language models for named entity recognition. arXiv e-prints, pages arXiv–2402. (2024).

Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze. 2008. Introduction to Information Retrieval. Cambridge University Press, USA.

Manuel Mondal, Mourad Khayati, Hông-Ân Sandlin, and Philippe Cudré-Mauroux. 2025. A survey of multimodal event detection based on data fusion. The VLDB Journal 34, 1 (2025), 9.

LUCAS M. NOVAES. 2024. The Violence of Law-and-Order Politics: The Case of Law Enforcement Candidates in Brazil. American Political Science Review 118, 1 (2024), 1–20. DOI: 10.1017/S0003055423000540

Raphael Ottoni, Evandro Cunha, Gabriel Magno, Pedro Bernardina, Wagner Meira Jr, and Virgílio Almeida. 2018. Analyzing right-wing youtube channels: Hate, violence and discrimination. In Proceedings of the 10th ACM conference on web science. 323–332.

Omitido para revisão as cegas. 2024. Omitido para revisão as cegas.. In Omitido.

AJ Piergiovanni and Michael Ryoo. 2020. Learning multimodal representations for unseen activities. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 517–526.

Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics. [link]

Simone Romano, Nguyen Xuan Vinh, James Bailey, and Karin Verspoor. 2016. Adjusting for Chance Clustering Comparison Measures. Journal of Machine Learning Research 17, 134 (2016), 1–32. [link]

Thomas Steiner, Ruben Verborgh, Rik Van de Walle, Michael Hausenblas, and Joaquim Gabarró Vallès. 2011. Crowdsourcing event detection in YouTube videos 58-67. In DeRiVE 2011 Detection, Representation, and Exploitation of Events in the Semantic Web. CEUR Workshop Proceedings, 58–67.

Jason Thies, Lukas Stappen, Gerhard Hagerer, Björn W Schuller, and Georg Groh. 2021. GraphTMT: unsupervised graph-based topic modeling from video transcripts. In 2021 IEEE Seventh International Conference on Multimedia Big Data (BigMM). IEEE, 1–8.

Nguyen Xuan Vinh, Julien Epps, and James Bailey. 2009. Information theoretic measures for clusterings comparison: is a correction for chance necessary?. In Proceedings of the 26th Annual International Conference on Machine Learning (Montreal, Quebec, Canada) (ICML ’09). Association for Computing Machinery, New York, NY, USA, 1073–1080. DOI: 10.1145/1553374.1553511

Andrew JWeaver, Asta Zelenkauskaite, and Lelia Samson. 2012. The (non) violent world of YouTube: Content trends in web video. Journal of Communication 62, 6 (2012), 1065–1083.

Shuang Wu, Sravanthi Bondugula, Florian Luisier, Xiaodan Zhuang, and Pradeep Natarajan. 2014. Zero-shot event detection using multi-modal fusion of weakly supervised concepts. In Proceedings of the IEEE conference on computer vision and pattern recognition. 2665–2672.

Kanwal Yousaf and Tabassam Nawaz. 2022. A Deep Learning-Based Approach for Inappropriate Content Detection and Classification of YouTube Videos. IEEE Access 10 (2022), 16283–16298. DOI: 10.1109/ACCESS.2022.3147519

Niloofar Yousefi, Mainuddin Shaik, and Nitin Agarwal. 2024. Characterizing multimedia information environment through multi-modal clustering of youtube videos. In International Conference on Smart Multimedia. Springer, 295–309.

Érica Pereira, Philipe Melo, Manoel Júnior, Vitor Mafra, Julio Reis, and Fabrício Benevenuto. 2022. Analyzing YouTube Videos Shared on WhatsApp and Telegram Political Public Groups. In Proceedings of the 28th Brazilian Symposium on Multimedia and the Web (Curitiba). SBC, Porto Alegre, RS, Brasil, 29–38. [link]
Publicado
10/11/2025
ROCHA, Saul Sousa da; VALE E SILVA, Carlos Henrique; SILVA, Mateus José da; TORRES NETO, Jose Rodrigues; FERREIRA, Carlos Henrique G.; GONÇALVES, Glauber Dias. Uso de Características Temporais e Semânticas para Detectar Eventos em Vídeos de Violência Urbana. In: BRAZILIAN SYMPOSIUM ON MULTIMEDIA AND THE WEB (WEBMEDIA), 31. , 2025, Rio de Janeiro/RJ. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2025 . p. 464-472. DOI: https://doi.org/10.5753/webmedia.2025.16150.

Artigos mais lidos do(s) mesmo(s) autor(es)

1 2 > >>