Neil Rowe (Naval Postgraduate School)

Abstract

The National Software Reference Library (NSRL) is an essential data source for forensic investigators, providing in its Reference Data Set (RDS) a set of hash values of known software. However, the NSRL RDS has not previously been tested against a broad spectrum of real-world data. The current work did this using a corpus of 36 million files on 2337 drives from 21 countries. These experiments answered a number of important questions about the NSRL RDS, including what fraction of files it recognizes of different types. NSRL coverage by vendor/product was also tested, finding 51% of the vendor/product names in our corpus had no hash values at all in NSRL. It is shown that coverage or “recall” of the NSRL can be improved with additions from our corpus such as frequently-occurring files and files whose paths were found previously in NSRL with a different hash value. This provided 937,570 new hash values which should be uncontroversial additions to NSRL. Several additional tests investigated the accuracy of the NSRL data. Experiments testing the hash values saw no evidence of errors. Tests of file sizes showed them to be consistent except for a few cases. On the other hand, the product types assigned by NSRL can be disputed, and it failed to recognize any of a sample of virus-infected files. The file names provided by NSRL had numerous discrepancies with the file names found in the corpus, so the discrepancies were categorized; among other things, there were apparent spelling and punctuation errors. Some file names suggest that NSRL hash values were computed on deleted files, not a safe practice. The tests had the secondary benefit of helping identify occasional errors in the metadata obtained from drive imaging on deleted files in our corpus. This research has provided much data useful in improving NSRL and the forensic tools that depend upon it. It also provides a general methodology and software for testing hash sets against corpora.