Classification of Unstructured Textual Data: A Case Study in the Public Security Area

  • Brenda Cardoso UFPA
  • Fantiny Santos UFPA
  • Angela Amador UFPA
  • Marisa de Andrade UFPA
  • Renato Torres UFPA
  • Nelson Neto UFPA

Abstract


The processing and classification of unstructured data are challenges in the information age. In the public security area, the lack of textual structuring of narratives in police reports (BOs) makes the precise categorization of crimes and the identification of the target audience even more complex. Thus, this paper proposes a method to speed up context classification in BOs through machine learning. The starting goal is to categorize crimes of insult directed or not at the LGBTQIA+ community based on reports from Pará State Police. The results highlight the potential applicability of the proposed approach in real and contextualized scenarios, contributing to the work of police authorities.

References

Albrecht, J., Ramachandran, S., and Winkler, C. (2020). Blueprints for Text Analytics Using Python.

Birks, D., Coleman, A., and Jackson, D. (2020). Unsupervised identification of crime problems from police free-text data. Crime Science, 9(1):18.

George, L. E. and Birla, L. (2018). A study of topic modeling methods. In 2018 second international conference on intelligent computing and control systems (iciccs), pages 109–113. IEEE.

Gusmão, C., Figueiredo, K., and Brito, W. A. (2021). Técnicas de processamento de linguagem natural em denúncias criminais: Automatizaçao e classificaçao de texto em português coloquial. In Anais do XLVIII Seminário Integrado de Software e Hardware, pages 172–182. SBC.

Hotz, N. (2023). What is CRISP DM? Data Science Process Alliance — datascience-pm.com. [link]. [Accessed 31-01-2024].

Kuang, D., Brantingham, P. J., and Bertozzi, A. L. (2017). Crime topic modeling. Crime Science, 6(1):1–20.

Mallek, M., Fournier, S., Guetari, R., Espinasse, B., and Chaari, W. L. (2020). An unsupervised approach for precise context identification from unstructured text documents. In 2020 IEEE 32nd International Conference on Tools with Artificial Intelligence (ICTAI), pages 821–826. IEEE.

Nasr, B., Chamoun, M., and Steyaert, J. M. (2022). Optimizing the process of police hotlines. In 2022 IEEE 1st Industrial Electronics Society Annual On-Line Conference (ONCON), pages 1–6. IEEE.

Palad, E. B. B., Tangkeko, M. S., Magpantay, L. A. K., and Sipin, G. L. (2019). Document classification of filipino online scam incident text using data mining techniques. In 2019 19th International Symposium on Communications and Information Technologies (ISCIT), pages 232–237. IEEE.

Pinheiro, V., Furtado, V., Pequeno, T., and Nogueira, D. (2010). Natural language processing based on semantic inferentialism for extracting crime information from text. In 2010 IEEE International Conference on Intelligence and Security Informatics, pages 19–24.

Ramos, J. et al. (2003). Using tf-idf to determine word relevance in document queries. In Proceedings of the first instructional conference on machine learning, volume 242, pages 29–48. Citeseer.

Rodrigues, A., González, J. A., and Mateu, J. (2023). A conditional machine learning classification approach for spatio-temporal risk assessment of crime data. Stochastic Environmental Research and Risk Assessment, pages 1–14.

Sakhare, N. N. and Joshi, S. A. (2014). Classification of criminal data using j48-decision tree algorithm. IFRSA International Journal of Data Warehousing & Mining, 4.

Shalev-Shwartz, S. and Ben-David, S. (2014). Understanding machine learning: From theory to algorithms. Cambridge university press.

Wei, L. (2023). Genetic algorithm optimization of concrete frame structure based on improved random forest. In 2023 International Conference on Electronics and Devices, Computational Science (ICEDCS), pages 249–253.
Published
2024-07-21
CARDOSO, Brenda; SANTOS, Fantiny; AMADOR, Angela; ANDRADE, Marisa de; TORRES, Renato; NETO, Nelson. Classification of Unstructured Textual Data: A Case Study in the Public Security Area. In: INTEGRATED SOFTWARE AND HARDWARE SEMINAR (SEMISH), 51. , 2024, Brasília/DF. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2024 . p. 61-72. ISSN 2595-6205. DOI: https://doi.org/10.5753/semish.2024.1989.