Petter Christian Bjelland
André Årnes
Katrin Franke

Abstract

Approximate Hash Based Matching (AHBM), also known as Fuzzy Hashing, is used to identify complex and unstructured data that has a certain amount of byte-level similarity. Common use cases include the identification of updated versions of documents and fragments recovered from memory or deleted files. Though several algorithms exist, there has not yet been an extensive focus on its practical use in digital investigations. The paper addresses the research question: How can AHBM be applied in digital investigations? It focuses on common scenarios in which AHBM can be applied, as well as the potential significance of its results. First, an assessment of AHBM for digital investigations with respect to existing algorithms and requirements for efficiency and precision is given. Then follows a description of scenarios in which it can be applied. The paper presents three modes of operation for Approximate Matching, namely searching, streaming and clustering. Each of the modes are tested in practical experiments. The results show that AHBM has great potential for helping investigators discover information based on data similarity. Three open source tools were implemented during the research leading up to this paper: Autopsy AHBM enables AHBM in an existing digital investigation framework, sddiff helps understanding AHBM results through visualization, and makecluster improves analysis of graphs generated from large datasets by storing each disjunct cluster separately.