Understanding the effects of removing common blocks on Approximate Matching scores under different scenarios for digital forensic investigations

  • Vitor Hugo Moia UNICAMP
  • Frank Breitinger University of New Haven
  • Marco Aurélio Henriques UNICAMP


Finding similarity in digital forensics investigations can be assisted with the use of Approximate Matching (AM) functions. These algorithms create small and compact representations of objects (similar to hashes) which can be compared to identify similarity. However, often results are biased due to common blocks (data structures found in many different files regardless of content). In this paper, we evaluate the precision and recall metrics for AM functions when removing common blocks. In detail, we analyze how the similarity score changes and impacts different investigation scenarios. Results show that many irrelevant matches can be filtered out and that a new interpretation of the score allows a better similarity detection.


Bloom, B. H. (1970). Space/time trade-offs in hash coding with allowable errors. Commun. ACM, 13(7):422–426.

Breitinger, F. and Baier, H. (2013). Similarity Preserving Hashing: Eligible Properties and a New Algorithm MRSH-v2, pages 167–182. Springer Berlin Heidelberg, Berlin, Heidelberg.

Breitinger, F., Guttman, B., McCarrin, M., Roussev, V., and White, D. (2014). Approximate matching: denition and terminology. NIST Special Publication, 800:168.

Foster, K. (2012). Using distinct sectors in media sampling and full media analysis to detect presence of documents from a corpus. Technical report, Naval Post-graduate School Monterey (CA).

Garnkel, S. L. and McCarrin, M. (2015). Hash-based carving: Searching media for complete les and le fragments with sector hashing and hashdb. Digital Investigation, 14:S95–S105.

Gutierrez-Villarreal, F. J. (2015). Improving sector hash carving with rule-based and entropy-based non-probative block lters. Technical report, Naval Postgraduate School Monterey (CA).

Kornblum, J. (2006). Identifying almost identical les using context trig- gered piecewise hashing. Digital investigation, 3:91–97.

Moia, V. H. G., Breitinger, F., and Henriques, M. A. A. (2019). The impact of excluding common blocks for approximate matching. pages 1–11. TO BE PUBLISHED.

Oliver, J., Cheng, C., and Chen, Y. (2013). TLSH–a locality sensitive hash. In Cybercrime and Trustworthy Computing Workshop (CTC), 2013 Fourth, pages 7–13. IEEE.

Olson, D. L. and Delen, D. (2008). Advanced data mining techniques. Springer Science & Business Media.

Raff, E. and Nicholas, C. (2018). Lempel-ziv jaccard distance, an effective alternative to ssdeep and sdhash. Digital Investigation, 24:34–49.

Roussev, V. (2010). Data ngerprinting with similarity digests. In IFIP International Conf. on Digital Forensics, pages 207–226. Springer.

Roussev, V. (2011). An evaluation of forensic similarity hashes. Digital investigation, 8:34–41.
Como Citar

Selecione um Formato
MOIA, Vitor Hugo; BREITINGER, Frank; HENRIQUES, Marco Aurélio. Understanding the effects of removing common blocks on Approximate Matching scores under different scenarios for digital forensic investigations. In: SIMPÓSIO BRASILEIRO DE SEGURANÇA DA INFORMAÇÃO E DE SISTEMAS COMPUTACIONAIS (SBSEG), 19. , 2019, São Paulo. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2019 . p. 113-126. DOI: https://doi.org/10.5753/sbseg.2019.13966.