Police Report Similarity Search: A Case Study

  • José Alan Firmiano Araújo UFC
  • Ticiana L. Coelho da Silva UFC / Insight Data Science Lab
  • Atslands Rego da Rocha UFC
  • Vinicius Cezar Monteiro de Lira Insight Data Science Lab


Several crimes occur daily, and the initial investigation begins with a police report. In cities with high crime rates, it is impractical to expect the police to read and analyze every crime narrative. Some police reports may involve multiple victims or the same crime may be reported more than once. Additionally, police reports may exhibit similarities due to a shared modus operandi. This study addresses the challenge of providing a police report and searching for the most similar report in the database. A similar police report can be either another report with overlapping words or one that shares a similar modus operandi. One potential solution is to represent each police report as a feature vector and compare these vectors using a similarity function. Different methods can be employed to represent the narrative, including embedding vectors and count-based approaches such as TF-IDF. This research explores the use of pre-trained embedding representations at both the word and sentence levels, such as Universal Sentence Encoder, Word2Vec, RoBERTa, Doc2Vec, among others. We determine the most effective representation for capturing semantic and lexical similarities between police reports by comparing different embedding models. Furthermore, we compare the effectiveness of available pre-trained embedding models with a model trained specifically on a corpus of police reports. Another contribution of this work is the development of trained embedding models specifically tailored for the domain of police reports.
ARAÚJO, José Alan Firmiano; SILVA, Ticiana L. Coelho da; ROCHA, Atslands Rego da; LIRA, Vinicius Cezar Monteiro de. Police Report Similarity Search: A Case Study. In: BRAZILIAN CONFERENCE ON INTELLIGENT SYSTEMS (BRACIS), 12. , 2023, Belo Horizonte/MG. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2023 . p. 394-409. ISSN 2643-6264.