A Study on Approximate Matching for Similarity Search: Techniques, Limitations and Improvements for Digital Forensic Investigations
Resumo
With the rapid increase in data storage capacity, the use of automated procedures to handle the massive volume of data available nowadays is required by digital forensic practitioners. The Known File Filtering is one of the techniques employed to reduce/separate data (based on their hash values) for analysis, using a list of interest objects. However, due to the limitation of hashes in this scenario (inability to detect similar objects), new methods have been designed. These functions, called Approximate Matching (AM), are possible candidates to perform such a process because they can identify similarity in a very efficient way by creating and comparing compact representations of objects. In this work, we show how to perform the similarity search task (using AM) more efficiently using the Similarity Digest Search Strategies and perform a detailed analysis. Given the limitations found, we also propose a new strategy. Furthermore, we address some limitations of AM tools regarding the similarity detection, where many matches pointed out as similar, were indeed false positives. Another improvement made was in the comparison function of one of the most known AM tools after a theoretical assessment of it. We identified and mitigated some limitations, proposing a new and more precise similarity measurement. New applications of AM are also presented and analyzed: One for fast file identification (using sampling) and another for efficient fingerprint identification. This text is a summary of a Doctor of Science thesis presented and approved in Feb./2020 at the School of Electrical and Computer Engineering, University of Campinas, S˜ao Paulo, Brazil, and submitted to the Thesis and Dissertations Competition of SBSeg 2020.
Referências
Breitinger, F., Baier, H., and White, D. (2014a). On the database lookup problem of approximate matching. Digital Investigation, 11:S1–S9.
Breitinger, F., Guttman, B., McCarrin, M., Roussev, V., and White, D. (2014b). Approximate matching: definition and terminology. NIST Special Publication, 800:168.
Breitinger, F., Rathgeb, C., and Baier, H. (2014c). An efficient similarity digests database lookup-a logarithmic divide & conquer approach. The Journal of Digital Forensics, Security and Law: JDFSL, 9(2):155.
Harichandran, V. S., Breitinger, F., and Baggili, I. (2016). Bytewise approximate matching: The good, the bad, and the unknown. The Journal of Digital Forensics, Security and Law: JDFSL, 11(2):59.
Kornblum, J. (2006). Identifying almost identical files using context triggered piecewise hashing. Digital investigation, 3:91–97.
Moia, V. H. G., Breitinger, F., and Henriques, M. A. A. (2019). Understanding the effects of removing common blocks on approximate matching scores under different scenarios for digital forensic investigations. In XIX Brazilian Symposium on information and computational systems security, pages 1–14, São Paulo-SP, Brazil. Brazilian Computer Society (SBC).
Moia, V. H. G., Breitinger, F., and Henriques, M. A. A. (2020). The impact of excluding common blocks for approximate matching. Computers & Security, v.89:1–11. Available at: https://doi.org/10.1016/j.cose.2019.101676.
Moia, V. H. G. and Henriques, M. A. A. (2016). Sampling and similarity hashes in digital forensics: An efficient approach to find needles in a haystack. In XVI Brazilian Symposium on information and computational systems security, pages 693–702, Niteroi-RJ, Brazil. Brazilian Computer Society (SBC).
Moia, V. H. G. and Henriques, M. A. A. (2017a). A comparative analysis about similarity search strategies for digital forensics investigations. In XXXV XXXVII Brazilian Symposium on Telecommunications and Signal Processing (SBrT 2017), pages 462–466, Sao Pedro, Brazil.
Moia, V. H. G. and Henriques, M. A. A. (2017b). Fast similarity digest search: A new strategy for performing queries efficiently with approximate matching. In XVII Brazilian Symposium on information and computational systems security, Brasilia-DF, Brazil. Brazilian Computer Society (SBC).
Moia, V. H. G. and Henriques, M. A. A. (2017c). An operational costs analysis of similarity digest search strategies using approximate matching tools. In XVII Brazilian Symposium on information and computational systems security, Brasilia-DF, Brazil. Brazilian Computer Society (SBC).
Moia, V. H. G. and Henriques, M. A. A. (2017d). Similarity digest search: A survey and comparative analysis of strategies to perform known file filtering using approximate matching. Security and Communication Networks, 2017.
Moia, V. H. G. and Henriques, M. A. A. (2018). A new similarity digest search strategy applied to minutia cylinder-codes for fingerprint identification. In XVIII Brazilian Symposium on Information and Computational Systems Security, pages 99–112, Natal-RN, Brazil. SBC.
Oliver, J., Cheng, C., and Chen, Y. (2013). TLSH–a locality sensitive hash. In Cybercrime and Trustworthy Computing Workshop, pages 7–13. IEEE.
Roussev, V. (2010). Data fingerprinting with similarity digests. In IFIP International Conf. on Digital Forensics, pages 207–226. Springer.
Roussev, V. (2011). An evaluation of forensic similarity hashes. Digital investigation, 8:34–41.
Winter, C., Schneider, M., and Yannikos, Y. (2013). F2s2: Fast forensic similarity search through indexing piecewise hash signatures. Digital Investigation, 10(4):361–371.