DataCrimeBR: Construction of a Dataset of Crimes Reported in Tweets in Brazil
Abstract
Social networks such as Twitter/X are widely used to share experiences and events, including crime reports. The automatic identification of these reports faces challenges, especially due to the lack of Portuguese-language datasets that capture linguistic ambiguity and informal language use, which hinder the distinction between factual descriptions and figurative expressions. The DataCrimeBR dataset comprises 61,715 tweets in Portuguese, obtained through a rigorous curation process, crime-type categorization, and the application of filters to ensure linguistic and thematic quality. The dataset was also enriched with sentiment analysis, toxicity detection, and geographic location entity recognition, offering a robust resource for research in Natural Language Processing and public safety, useful for developing and evaluating systems aimed at detecting crime reports in digital environments.
Keywords:
Natural Language Processing, Crime Detection, Twitter, Text Mining, Public Safety
References
Abbass, Z., Ali, Z., Ali, M., Akbar, B., and Saleem, A. (2020). A framework to predict social crime through twitter tweets by using machine learning. In 2020 IEEE 14th International Conference on Semantic Computing (ICSC), pages 363–368.
Abdala, V. (2022). Pesquisa do IBGE mostra subnotificação de roubos e furtos no Brasil. Online. Agência Brasil.
Almeida, T. L. M. d. (2018). Estudo sobre aplicação de aprendizado de máquina para identificação de assaltos através de informações do twitter.
Barnes, J. (2023). Twitter Ends Its Free API: Here’s Who Will Be Affected — forbes.com. [link].
Beserra, T. (2022). Quais são os tipos de crimes sexuais previstos no Brasil? [link]. [Accessed 09-01-2025].
Bokolo, B. G., Onyehanere, P., Ogegbene-Ise, E., Olufemi, I., and Tettey, J. N. A. (2024). Leveraging machine learning for crime intent detection in social media posts. In Zhao, F. and Miao, D., editors, AI-generated Content, pages 224–236, Singapore. Springer.
Ciardo, F. (2015). Do Homicídio - Artigo 121 do Código Penal. [link]. [Accessed 09-01-2025].
Clarindo, J., Coutinho, F., and Freitas, A. (2016). Detecção de casos de violência patrimonial a partir do twitter. In Anais do V Brazilian Workshop on Social Network Analysis and Mining, pages 211–216, Porto Alegre, RS, Brasil. SBC.
Coppersmith, G., Dredze, M., and Harman, C. (2014). Quantifying mental health signals in twitter. In Proceedings of the workshop on computational linguistics and clinical psychology: From linguistic signal to clinical reality, pages 51–60.
da Fonseca Miranda, G. V., Almeida, V. G. d. J., Silva, T. R. B., and Silva, F. A. (2023). Extraçao e avaliaçao de uma base de dados sobre criminalidade em português a partir do twitter. In Anais do XV Simpósio Brasileiro de Computação Ubíqua e Pervasiva, pages 61–70. SBC.
de Araujo, A. A. (2017). Qual a diferença entre furto e roubo? [link]. [Accessed 09-01-2025].
De Choudhury, M., Counts, S., Horvitz, E. J., and Hoff, A. (2014). Characterizing and predicting postpartum depression from shared facebook data. In Proceedings of the 17th ACM conference on Computer supported cooperative work & social computing, pages 626–638.
dos Santos, L. S. F. C. (2015). Estudo online da dinâmica espaço-temporal de crimes através de dados da rede social twitter.
Dunn, N. (2024). Top 26 X (Formerly Twitter) Statistics. Online. Charle Agency.
Earl, J., McKee Hurwitz, H., Mejia Mesinas, A., Tolan, M., and Arlotti, A. (2013). This protest will be tweeted: Twitter and protest policing during the pittsburgh g20. Information, communication & society, 16(4):459–478.
Giachanou, A. and Crestani, F. (2016). Like it or not: A survey of twitter sentiment analysis methods. ACM Comput. Surv., 49(2).
Go, A., Bhayani, R., and Huang, L. (2009). Twitter sentiment classification using distant supervision. CS224N project report, Stanford, 1(12):2009.
Guillou, P. (2021). Nlp: Modelos e web app para reconhecimento de entidade nomeada (ner) no domínio jurídico. Acesso em: 19 nov. 2024.
Hernandes, R. (2023). Mudança no Twitter cria dificuldade para pesquisadores com extração e análise de dados. Folha de São Paulo. [link]. [Accessed 08-08-2025].
Lombo, X., Oyelade, O. N., and Ezugwu, A. E. (2022). Crime detection and analysis from social media messages using machine learning and natural language processing technique. In Gervasi, O., Murgante, B., Misra, S., Rocha, A. M. A. C., and Garau, C., editors, Computational Science and Its Applications 2022 Workshops, pages 502–517, Cham. Springer.
Patricio, G. S. (2023). Criação de um aplicativo para mapeamento da criminalidade da cidade de belo horizonte por meio de atividade crowdsourcing no twitter.
Pérez, J. M., Rajngewerc, M., Giudici, J. C., Furman, D. A., Luque, F., Alemany, L. A., and Martínez, M. V. (2021). pysentimiento: a python toolkit for opinion mining and social nlp tasks. arXiv preprint arXiv:2106.09462.
Sandagiri, S., Kumara, B., and Kuhaneswaran, B. (2020a). Ann based crime detection and prediction using twitter posts and weather data. In 2020 International Conference on Data Analytics for Business and Industry: Way Towards a Sustainable Economy (ICDABI), pages 1–5.
Sandagiri, S., Kumara, B., and Kuhaneswaran, B. (2020b). Detecting crime related twitter posts using artificial neural networks based approach. In 2020 20th International Conference on Advances in ICT for Emerging Regions (ICTer), pages 5–10.
Shoeibi, N., Shoeibi, N., Hernández, G., Chamoso, P., and Corchado, J. M. (2021). Ai-crime hunter: An ai mixture of experts for crime discovery on twitter. Electronics, 10(24).
Siddiqui, T., Hina, S., Asif, R., Ahmed, S., and Ahmed, M. (2023). An ensemble approach for the identification and classification of crime tweets in the english language. Computer Science and Information Technologies, 4:149–159.
Tedeschi, S., Maiorca, V., Campolungo, N., Cecconi, F., and Navigli, R. (2021). Wikineural: Combined neural and knowledge-based silver data creation for multi-lingual ner. In Findings of the association for computational linguistics: EMNLP 2021, pages 2521–2533.
Vedova, D. (2018). O que é segurança publica. [link].
Vieweg, S., Hughes, A. L., Starbird, K., and Palen, L. (2010). Microblogging during two natural hazards events: what twitter may contribute to situational awareness. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, CHI ’10, page 1079–1088, New York, NY, USA. Association for Computing Machinery.
Vo, T., Sharma, R., Kumar, R., Son, L. H., Pham, B. T., Tien Bui, D., Priyadarshini, I., Sarkar, M., and Le, T. (2020). Crime rate detection using social media of different crime locations and twitter part-of-speech tagger with brown clustering. J. Intell. Fuzzy Syst., 38(4):4287–4299.
WPR (2025). Twitter/X Users by Country 2025. Online. World Population Review. [link].
Zheng, X., Han, J., and Sun, A. (2018). A survey of location prediction on twitter. IEEE Transactions on Knowledge and Data Engineering, 30(9):1652–1671.
Abdala, V. (2022). Pesquisa do IBGE mostra subnotificação de roubos e furtos no Brasil. Online. Agência Brasil.
Almeida, T. L. M. d. (2018). Estudo sobre aplicação de aprendizado de máquina para identificação de assaltos através de informações do twitter.
Barnes, J. (2023). Twitter Ends Its Free API: Here’s Who Will Be Affected — forbes.com. [link].
Beserra, T. (2022). Quais são os tipos de crimes sexuais previstos no Brasil? [link]. [Accessed 09-01-2025].
Bokolo, B. G., Onyehanere, P., Ogegbene-Ise, E., Olufemi, I., and Tettey, J. N. A. (2024). Leveraging machine learning for crime intent detection in social media posts. In Zhao, F. and Miao, D., editors, AI-generated Content, pages 224–236, Singapore. Springer.
Ciardo, F. (2015). Do Homicídio - Artigo 121 do Código Penal. [link]. [Accessed 09-01-2025].
Clarindo, J., Coutinho, F., and Freitas, A. (2016). Detecção de casos de violência patrimonial a partir do twitter. In Anais do V Brazilian Workshop on Social Network Analysis and Mining, pages 211–216, Porto Alegre, RS, Brasil. SBC.
Coppersmith, G., Dredze, M., and Harman, C. (2014). Quantifying mental health signals in twitter. In Proceedings of the workshop on computational linguistics and clinical psychology: From linguistic signal to clinical reality, pages 51–60.
da Fonseca Miranda, G. V., Almeida, V. G. d. J., Silva, T. R. B., and Silva, F. A. (2023). Extraçao e avaliaçao de uma base de dados sobre criminalidade em português a partir do twitter. In Anais do XV Simpósio Brasileiro de Computação Ubíqua e Pervasiva, pages 61–70. SBC.
de Araujo, A. A. (2017). Qual a diferença entre furto e roubo? [link]. [Accessed 09-01-2025].
De Choudhury, M., Counts, S., Horvitz, E. J., and Hoff, A. (2014). Characterizing and predicting postpartum depression from shared facebook data. In Proceedings of the 17th ACM conference on Computer supported cooperative work & social computing, pages 626–638.
dos Santos, L. S. F. C. (2015). Estudo online da dinâmica espaço-temporal de crimes através de dados da rede social twitter.
Dunn, N. (2024). Top 26 X (Formerly Twitter) Statistics. Online. Charle Agency.
Earl, J., McKee Hurwitz, H., Mejia Mesinas, A., Tolan, M., and Arlotti, A. (2013). This protest will be tweeted: Twitter and protest policing during the pittsburgh g20. Information, communication & society, 16(4):459–478.
Giachanou, A. and Crestani, F. (2016). Like it or not: A survey of twitter sentiment analysis methods. ACM Comput. Surv., 49(2).
Go, A., Bhayani, R., and Huang, L. (2009). Twitter sentiment classification using distant supervision. CS224N project report, Stanford, 1(12):2009.
Guillou, P. (2021). Nlp: Modelos e web app para reconhecimento de entidade nomeada (ner) no domínio jurídico. Acesso em: 19 nov. 2024.
Hernandes, R. (2023). Mudança no Twitter cria dificuldade para pesquisadores com extração e análise de dados. Folha de São Paulo. [link]. [Accessed 08-08-2025].
Lombo, X., Oyelade, O. N., and Ezugwu, A. E. (2022). Crime detection and analysis from social media messages using machine learning and natural language processing technique. In Gervasi, O., Murgante, B., Misra, S., Rocha, A. M. A. C., and Garau, C., editors, Computational Science and Its Applications 2022 Workshops, pages 502–517, Cham. Springer.
Patricio, G. S. (2023). Criação de um aplicativo para mapeamento da criminalidade da cidade de belo horizonte por meio de atividade crowdsourcing no twitter.
Pérez, J. M., Rajngewerc, M., Giudici, J. C., Furman, D. A., Luque, F., Alemany, L. A., and Martínez, M. V. (2021). pysentimiento: a python toolkit for opinion mining and social nlp tasks. arXiv preprint arXiv:2106.09462.
Sandagiri, S., Kumara, B., and Kuhaneswaran, B. (2020a). Ann based crime detection and prediction using twitter posts and weather data. In 2020 International Conference on Data Analytics for Business and Industry: Way Towards a Sustainable Economy (ICDABI), pages 1–5.
Sandagiri, S., Kumara, B., and Kuhaneswaran, B. (2020b). Detecting crime related twitter posts using artificial neural networks based approach. In 2020 20th International Conference on Advances in ICT for Emerging Regions (ICTer), pages 5–10.
Shoeibi, N., Shoeibi, N., Hernández, G., Chamoso, P., and Corchado, J. M. (2021). Ai-crime hunter: An ai mixture of experts for crime discovery on twitter. Electronics, 10(24).
Siddiqui, T., Hina, S., Asif, R., Ahmed, S., and Ahmed, M. (2023). An ensemble approach for the identification and classification of crime tweets in the english language. Computer Science and Information Technologies, 4:149–159.
Tedeschi, S., Maiorca, V., Campolungo, N., Cecconi, F., and Navigli, R. (2021). Wikineural: Combined neural and knowledge-based silver data creation for multi-lingual ner. In Findings of the association for computational linguistics: EMNLP 2021, pages 2521–2533.
Vedova, D. (2018). O que é segurança publica. [link].
Vieweg, S., Hughes, A. L., Starbird, K., and Palen, L. (2010). Microblogging during two natural hazards events: what twitter may contribute to situational awareness. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, CHI ’10, page 1079–1088, New York, NY, USA. Association for Computing Machinery.
Vo, T., Sharma, R., Kumar, R., Son, L. H., Pham, B. T., Tien Bui, D., Priyadarshini, I., Sarkar, M., and Le, T. (2020). Crime rate detection using social media of different crime locations and twitter part-of-speech tagger with brown clustering. J. Intell. Fuzzy Syst., 38(4):4287–4299.
WPR (2025). Twitter/X Users by Country 2025. Online. World Population Review. [link].
Zheng, X., Han, J., and Sun, A. (2018). A survey of location prediction on twitter. IEEE Transactions on Knowledge and Data Engineering, 30(9):1652–1671.
Published
2025-09-29
How to Cite
SILVA, Miguel A. R. e; MELO, Philipe de Freitas; SILVA, Thais R. M. Braga.
DataCrimeBR: Construction of a Dataset of Crimes Reported in Tweets in Brazil. In: DATASET SHOWCASE WORKSHOP (DSW), 7. , 2025, Fortaleza/CE.
Anais [...].
Porto Alegre: Sociedade Brasileira de Computação,
2025
.
p. 83-94.
DOI: https://doi.org/10.5753/dsw.2025.247819.
