Bit-errors as a source of forensic information in NAND-flash memory

Jan Peter van Zandwijk *
Netherlands Forensic Institute, Laan van Ypenburg 6, 2497 GB The Hague, The Netherlands

ARTICLE INFO
Article history:
Received 26 January 2017
Accepted 26 January 2017

Keywords:
NAND-flash reliability
P/E cycles
Retention bit-errors
Error-correcting codes
Hardware forensics

ABSTRACT
The value of bit-errors as a source of forensic information is investigated by experiments on isolated NAND-flash chips and USB thumb-drives. Experiments on isolated NAND-flash chips, programmed directly using specialized equipment, show detectable differences in retention bit-errors over forensically relevant time periods with the device used within manufacturer specifications. In experiments with USB thumb-drives, the controller is used to load files at different times onto the drives, some of which have been subjected to stress-cycling. Retention bit-error statistics of memory pages obtained by offline analysis of NAND-flash chips from the thumb-drives are to some extent linked to the time files are loaded onto the drives. Considerable variation between USB thumb-drives makes interpretation of bit-error statistics in absolute sense difficult, although in a relative sense bit-error statistics seems to have some potential as an independent side-channel of forensic information.
© 2017 The Author(s). Published by Elsevier Ltd on behalf of DFRWS. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).

Introduction

Due to high storage density, low access latency and low costs, NAND-flash has become the medium of choice for non-volatile data storage. This holds for both low-end consumer electronics as well as for high-end server based storage environments. The latter increasingly use state-of-the-art solid state disks (SSDs) containing NAND-flash memory chips for persistent data storage. With increased proliferation of SSDs in server based storage environments, more attention has been devoted to reliability and endurance of the underlying NAND-flash memory chips (Meza et al., 2015). This all the more important, since increase of NAND-flash storage capacity goes hand in hand with a decrease in both NAND-flash reliability and endurance. Decrease in reliability means that bit-errors are more likely to occur in NAND-flash using smaller technology and in NAND-flash which stores multiple bits in a single cell (i.e. multi level cells, MLC, or triple level cells, TLC) than in NAND-flash which stores a single bit per cell (single level cells, SLC). Also, endurance of NAND-flash decreases with the size of the technology and the use of MLCs and TLCs. The number of times a cell can reliably be programmed and erased (usually called P/E cycles) typically decreases from 100,000 for SLCs to 5000 or less for MLCs and TLCs. At the chip level, much research has been devoted to the characterization of mechanisms causing bit-errors in NAND-flash (see e.g. Cai et al., 2012). These mechanisms include retention time, read- and write-disturb and erase errors. Retention time bit-errors occur when information stored in NAND-flash memory cells changes over time. The mechanism responsible for this is the leakage of charge stored in the cells. Read and write-disturb errors occur when information in a cell is changed due to the fact that neighboring cells are being read or programmed. Erase errors occur when an erase operation on a NAND-flash memory page fails to reset the page to the erased state. In that case, the page is marked as bad and discarded for further data storage. These error mechanisms all depend on the number of P/E cycles a cell has undergone. Due to repeated programming and erasing, structures in the NAND-flash memory cells tend to deteriorate, thereby increasing the susceptibility of that cell to one of the error mechanisms and increasing the rate at which bit-errors develop. Besides P/E cycles, it is known that susceptibility of a cell to bit-errors is affected by the actual value of the data stored in the memory cell. As an example, fully programmed cells will tend to loose charge more quickly than cells which are in the erased state. Therefore, fully programmed cells are more prone to retention bit-errors than fully erased ones. In the literature, retention time is generally reported to be a larger source of bit-errors than the other error mechanisms mentioned above (e.g. Cai et al., 2012).

In order to circumvent problems associated with NAND-flash reliability, manufacturers use randomization techniques and error-correcting codes (ECCs). The former is aimed at reducing the number of errors occurring due to cell-to-cell interference (i.e.
read- and write-disturb errors), while the latter provides a mechanism for correcting any errors that might have occurred in the data. To increase NAND-flash endurance, it is important to ensure that all memory pages have undergone about the same number of P/E cycles. Manufacturers employ sophisticated wear leveling techniques to accomplish this and to ensure an even spread of wear throughout the NAND-flash.

The extent to which a digital forensic practitioner is confronted with issues of NAND-flash reliability and endurance depends on the type of acquisition being performed. In many investigations, data is acquired from NAND-flash through the controller, a piece of hardware providing an interface between the NAND-flash and the host operating system. In that case, the controller takes care of error-correction by application of ECCs. It also takes care of data randomization and wear leveling, which therefore occur completely transparent to the user. In cases where acquisition through the controller is impossible, however, one must resort to low-level acquisition, i.e. reading data directly from NAND-flash. Unfortunately, no factory documentation on this type of NAND-flash was found and therefore information specifying technology size, retention time and number of P/E cycles is lacking.

In order to determine parameter values for data randomization and ECC added by the controller when storing data on the NAND-flash of the Emtec Colormix device, the test device was loaded with known data through the controller. Next, the NAND-flash was de-soldered from the printed circuit board and read using equipment developed at the Netherlands Forensic Institute (NFI). Using methods described in Van Zandwijk (2015), parameter values for data randomization and error-correcting code applied by the controller were reverse engineered from a raw dump of the content of the NAND-flash. For this controller, no LFSR-based randomization was found to be in use. However, on the basis of known data, it was possible to explicitly reconstruct the complete byte-sequence used for data randomization. This sequence was used in subsequent experiments for de-randomization of data. Table 1 summarizes parameters found for the device. The meaning of the symbols is briefly explained in the table. For a more extensive description, the reader is referred to Van Zandwijk (2015).

Experimental protocols

Raw NAND chips

Raw NAND-flash chips were programmed and read using equipment developed at the NFI. In order to investigate the magnitude of retention bit-errors on forensically relevant timescales, the following two protocols are applied to Hynix H27UAG8T2ATR NAND-flash chips.

Raw NAND data retention protocol. Two fresh NAND-flash chips were completely programmed with random data a single time and kept in Petri dishes, either at room temperature or at 70 °C on a heating stove. Baking chips at higher temperatures is a technique often adopted to simulate longer retention times (see e.g. Cai et al., 2012; Niset and Kuhn, 2005). Using the Arrhenius model, one can compute an acceleration factor (AF) to extrapolate retention times at the higher temperature back to retention times at e.g. room temperature. Using parameters from Niset and Kuhn (2005), the AF of 70 °C with respect to room temperature is calculated to be about 60.

Periodically, the two chips were read and number of bit-errors computed by comparing read data to the data originally written to the chip. Number of bit-errors is expressed as bit-error-rate (BER) = number of bit-errors/number of bits.

Raw NAND stress-cycle protocol. Two fresh NAND-flash chips were subjected to a stress-cycle protocol. This protocol is aimed at...
Note that these numbers of P/E cycles were repeatedly programmed with random data and erased directly after copying. Owing to the fact that 3 GB of each drive is occupied throughout experiments with the first 3 files, there is no space left for wear leveling and the controller is most likely forced to repeatedly use the same remaining space on the drive for storage of the 628 MB file, thereby inducing wear in that area of the drive. After stress-cycling, two files containing 314 MB random data are copied to the drives at times described in Table 2.

### USB thumb-drive stress-cycle protocol
As in the USB data retention protocol, initially three files of 1 GB containing random data are copied onto two unused thumb-drives at times specified in Table 2. As in the USB thumb-drive data retention protocol, one drive is stored at room temperature between addition of files, the other is kept at 70 °C on a heating stove. After addition of the first three files, a file of 628 MB is copied and deleted directly after copying. Owing to the fact that 3 GB of each drive is occupied throughout experiments with the first 3 files, there is no space left for wear leveling and the controller is most likely forced to repeatedly use the same remaining space on the drive for storage of the 628 MB file, thereby inducing wear in that area of the drive. After stress-cycling, two files containing 314 MB random data are copied to the drives at times described in Table 2.

### USB thumb-drive NAND-flash processing and data analysis
Directly after addition of the last file in both the data retention and the stress-cycle protocol, NAND-flash chips were desoldered from the printed circuit board of the thumb-drives. Subsequently, a raw dump of content of the NAND-flash was produced using NFI equipment. Raw dumps of NAND-flash chips were further processed offline. This entails de-randomizing the content of memory pages and applying ECC information to correct bit-errors in the data. As a by-product, the latter produces statistical data on the number of bit-errors per memory page. In order to attribute the content of individual memory pages to files copied onto the thumb-drive, SHA1 hashes of 1 k datachunks in memory pages are computed after de-randomization and decoding. Subsequently, these hashes are compared to piecewise 1 k SHA1 hashes of the original files, similarly as described in Breeuwsma et al. (2007). A memory page is attributed to a specific file when the SHA1 of at least 14 of the 16 datachunks matches the SHA1 of a datachunk from that file.

By attributing the content of a memory page to a file copied onto the thumb-drive, one at the same time assigns statistical information on the number of bit-errors in that page to the file. Bit-error information from all memory pages assigned to a file can be aggregated to form the distribution of number of bit-errors for memory pages of that file. Comparing the distributions of bit-errors in memory pages of each of the five files copied onto the thumb-drives can be used to investigate the question to what extent it is possible to use the number of bit-errors in a given memory page as a (noisy) source of information to what file the page belongs. Indeed, if distributions of bit-errors of memory pages would be more or less non-overlapping for different files, it would be very well possible to use bit-error statistics to assign memory pages to a specific file. Conversely, if distributions of bit-errors of two different files are strongly overlapping, it would be impossible to distinguish between the two on the basis of bit-error information alone.

### Table 2
Details of experimental protocols used for Emtec Colormix USB thumb-drives.

<table>
<thead>
<tr>
<th>Number</th>
<th>Time (hours)</th>
<th>Data retention protocol</th>
<th>Stress-cycle protocol</th>
</tr>
</thead>
<tbody>
<tr>
<td>1</td>
<td>0</td>
<td>Added file 1 (1 GB)</td>
<td>Added file 1 (1 GB)</td>
</tr>
<tr>
<td>2</td>
<td>22</td>
<td>Added file 2 (1 GB)</td>
<td>Added file 2 (1 GB)</td>
</tr>
<tr>
<td>3</td>
<td>70</td>
<td>Added file 3 (1 GB)</td>
<td>Added file 3 (1 GB)</td>
</tr>
<tr>
<td>4</td>
<td>77</td>
<td>Start stress-cycle protocol.</td>
<td>File 6 (628 MB) is copied and deleted 200 times.</td>
</tr>
<tr>
<td>5</td>
<td>101</td>
<td>Added file 4 (314M)</td>
<td>Added file 4 (314M)</td>
</tr>
<tr>
<td>6</td>
<td>101</td>
<td>Added file 5 (314 M)</td>
<td>Added file 5 (314 M)</td>
</tr>
<tr>
<td></td>
<td></td>
<td>End experiment, chip desoldered.</td>
<td>End experiment, chip desoldered.</td>
</tr>
</tbody>
</table>

Between additions of files, the thumb-drives are stored in glass Petri dishes at either room temperature or 70 °C.
Results

Raw NAND-flash chips

Fig. 1 shows results of applying the data retention protocol to raw Hynix H27UAG8T2A7R NAND-flash chips. This figure shows BER as a function of time for chips kept at room temperature and at 70 °C. Detectable changes in BER are found to occur over periods of days to weeks, which seem to be relevant timescales in a forensic investigation. Note the differences in rate at which BER increases over time between NAND-flash chips kept at room temperature and kept at 70 °C.

Results of the raw NAND-flash stress-cycle protocol are shown in Fig. 2 for chips kept at room temperature and kept at 70 °C. The figure shows the block averaged BER as a function of blocknumber for a number of specified retention times. See figure legend for details. Note that for both temperatures, differences in P/E cycles of specific regions in the NAND-flash chip lead to clearly detectable differences in the rate at which bit-errors develop over time after reprogramming with fresh random data.

USB-think drives

Table 3 presents statistical data on the number of memory pages which could be attributed to files copied to the USB thumb-drives after offline data processing. For both protocols, at least 82% of the data of files copied onto the thumb-drive could be attributed to memory pages on the basis of SHA1 hashes using offline derandomization and decoding of raw dumps produced from desoldered NAND-flash chips. Somewhat surprisingly, the batch of Emtec Colormix USB thumb-drives used in this study is found to contain 8 GB NAND-flash chips, while content of the thumb-drive is specified to be 4 GB. From analysis of raw dumps, it was found that a large portion of the memory pages in the NAND-flash remain empty throughout both protocols (see Table 3). The reason for this is not clear, but possibly some hardware constraint might be used to limit the size of the NAND-flash available to the controller for storage.

Note that in Table 3 the number of empty pages is roughly the same for all four thumb-drives investigated. Besides this, note that less than 20% of the file used in stress-cycling (file 6 in Table 3) is found back on the NAND-flash after processing. Both can be taken as an indirect indication that during the stress-cycling protocol pages have been repeatedly erased and reprogrammed. More research would be necessary to gain a more thorough understanding of the actual amount of P/E cycles pages have been subjected to (see also Section “Discussion and conclusions”).

Figs. 3 and 4 show bit-error statistics of memory pages for thumb-drives kept at respectively 70 °C and room temperature for the two protocols used in this study. In each figure, colour coding is used to designate files loaded onto the thumb-drive. Note that memory pages belonging to the same file can get spread out throughout the NAND-flash memory due to wear leveling techniques applied by the USB-controller when storing data onto NAND-flash. Comparing the top panel from Fig. 4 with other panels from Figs. 3 and 4, it can be seen that there is considerable variability between the areas in the NAND-flash chips used by the controller for data storage. For the drives shown in Fig. 3, the first three files added onto the drives appear to be stored in two separated areas of the NAND-flash, while only one area appears to be used for storage of these files in the top panel of Fig. 4. Possibly, this indicates the use of two separate planes or dies in the NAND-flash. Similarly, the averaged bit-error level per page also varies between the four USB thumb-drives studied (compare e.g. the two panels of Fig. 3). To some extent, the base-line bit-error level appears to be related to the temperature at which the drives are stressed. For it can be seen from Figs. 3 and 4 that there is a definite interrelationship between the number of bit-errors in memory pages belonging to that file (e.g. file 5, which is added last, colour coded as purple), which holds for pages located in different areas of the chip. This interrelationship becomes more pronounced when bit-errors are averaged over blocks of 256 pages (i.e. the black dots in Figs. 3 and 4). On the basis of earlier experiments (see Section “Experimental protocols”) we believe that 256 pages is the actual size of an erase block for this type of NAND-flash chip, but independent confirmation of this is lacking.

The effect of stress-cycling on bit-error rates can be judged by comparing bit-error statistics of file 4 (light blue) and file 5 (purple), which were both added onto the drives after stress-cycling. As can been seen from Figs. 3 and 4, memory pages of these two files appear to be stored more fragmented onto the two drives which have been subjected to stress-cycling than in the other drives. In the latter ones, these two files are stored more sequentially, as is the case for the three files which have been added first. Due to significant variation between behaviour of NAND-flash chips studied, it is difficult to judge the effect of stress-cycling on bit-error rate in absolute terms. Qualitatively, it seems to be somewhat higher in the stress-cycled drives.

Figs. 5 and 6 show the distribution of page-wise bit-errors for all five files copied onto the thumb-drives for the two drives which have been subjected to stress cycling, i.e. the top panels of Figs. 3 and 4. Top panels of Figs. 5 and 6 contain page-wise distributions of bit-errors for different files copied onto the drives, while bottom panels show bit-errors averaged over blocks of 256 pages. Note that the former distributions are broader and more overlapping than the latter ones, which have smaller widths and are more separated.

As explained earlier, if distributions are non-overlapping, number of bit-errors can be used as a means to discriminate between different files. Note that separation between distributions for individual files is larger when bit-errors are averaged over blocks of 256 pages, the presumed size of an erase block. The data shown in the bottom panels of Figs. 5 and 6 suggest, that it would be possible to select memory pages belonging to file 5 (last added, colour coded as purple) on the basis of block averaged bit-error statistics. For this purpose, the page-wise bit-error distributions, shown in the top panels, appear to be less suited because page-wise bit-errors are more noisy and, hence, lead to broader distributions.
In this study we set out to investigate the value of page-wise bit-error statistics as a source of forensic information in NAND-flash. For this purpose, retention bit-errors were induced in both isolated raw NAND-flash chips and in USB-thumb drives. As explained earlier, besides retention time, also reading and writing of data can introduce bit-errors. More research and additional experimentation is required to shed light on the relative contribution of different sources of bit-errors. This is all the more important because contributions of different bit-error mechanisms might depend on the specific use of the NAND-flash. For instance, USB thumb-drives and SSDs are primarily used for data storage, which means that typically only user-assisted writes are performed on the NAND-flash. Then, data is kept for some time and subsequently read a limited number of times. Therefore, it seems to be plausible that data retention is the primary source of bit-errors in these devices.

The other hand, the use of NAND-flash in tablet computers or smartphones might be different. Containing data belonging to the operating system of the device, the NAND-flash in these devices might be exposed to many automatic reads and writes. This different usage might have an effect on relative contribution of different sources of bit-errors in NAND-flash.

In experiments on raw NAND-flash chips, detectable changes in bit-error statistics were observed over forensically relevant time periods while using devices within manufacturer specification. This is encouraging, since in research specifically aimed at investigating NAND-flash reliability, chips are often tested at or beyond manufacturer specifications. The data from Fig. 2 shows that BER depends both on retention time and the number of P/E cycles a specific area of the chip has been subjected to. These results indicate that stress-cycling a NAND-flash by a limited number of P/E cycles can already lead to detectable differences in the rate at which bit-errors develop. From the viewpoint of forensic data extraction, this can be taken as a hint that fresh NAND-flash might be less susceptible to perturbations than NAND-flash that has been used for some time and therefore has presumably been exposed to some stress-cycling. This in turn could suggest that response of a NAND-flash chip from a newly bought reference device to potentially disturbing events such as X-raying or desoldering might be different from that of a NAND-flash chip from an exhibit that has been used for some time.

Experiments on four USB thumb-drives containing the same controller and an unmarked NAND-flash chip show considerable inter-chip variability (Figs. 3–6). This variability makes it, of course, very difficult to interpret the (block averaged) number of bit-errors in an absolute sense as a measure of the time data has been present in a certain memory page. In a relative sense, however, bit-error statistics seem to offer some perspective for use in a forensic context.

Data from Figs. 3–6 show the possibility to link, at least qualitatively, (block averaged) number of bit-errors to files copied onto the drives by attributing the corresponding data to that file. By using files containing random data, it is easy to unequivocally attribute data from memory pages to specific files. In real life
situations, this might be more difficult, since pages might contain the same data or fragments of data of deleted files. At any rate, in a forensic context, the connection between bit-errors and files can potentially be used to help grouping memory pages belonging to the same file together, which have been spread across the NAND-flash due to wear leveling. This can help filesystem reconstruction and smart carving of data.

Current experiments can be taken as an indication that bit-error statistics contain some independent time information, since files that have been present at the NAND-flash for different amount of times show a different amount of retention bit-errors (cf. Figs. 3 – 4). Therefore, it would be tempting to use the (block averaged) number of bit-errors in memory pages of a file as a means of relative dating of files present on a NAND-flash device. At relatively dating,
however, there is the problem that it is not possible to separate the
collection of retention time and P/E cycles on the occurrence of
bit-errors from one another. This means that without additional
information, no distinction can be made between a
file that has
been briefly on part of the NAND-flash which has undergone many
P/E cycles and a file which has been on part of the NAND-flash with
few P/E cycles for a long time. One idea to try and compensate for
the effect of P/E cycles on rate of bit-error development would be to
erase a NAND-flash chip entirely after analysis and reprogram it
completely with random data. After that, the chip can be stored at a
high temperature for some time to let retention bit-errors develop.
If the chip is read out completely after that, one might observe that
areas which have undergone the most P/E cycles, will contain the
most bit-errors, similarly as the data shown in Fig. 2. This additional
information could then be used to correct the original data for
differences in P/E cycles.

Fig. 5. Histograms of distribution of number of bit-errors per page for thumb-drive kept at 70 °C, stress-cycle protocol. Same colour coding of files as used in Figs. 3 and 4. Top panel: distributions of number of bit-errors per page. Bottom panel: Distribution of number of bit-errors per page, averaged over blocks of 256 pages.

Fig. 6. Histograms of distribution of number of bit-errors per page for thumb-drive kept at room temperature, stress-cycle protocol. Same colour coding of files as used in Figs. 3 and 4. Top panel: distributions of number of bit-errors per page. Bottom panel: Distribution of number of bit-errors per page, averaged over blocks of 256 pages.
Acknowledgements

I am indebted to Marcel Breeuwsma and Ronald van der Knijff for useful comments on an early draft of this paper.

References


