donderdag 21 maart 2019

Recovering data from a corrupted SD card with Linux

1.Introduction


So the other day, someone asked me to have a look at their SD card. It had thousands of images on it from their holiday trip, but the card somehow got corrupted. It did not even show up anymore in windows when plugging in.. I said I'd quickly try to plug it in linux, and see what happened.

Eventually I managed to get 30GB out of the 64GB back with the help of photorec and a kernel driver patch!

I thought I'd share some of my findings of how SD cards work and how I managed to do this recovery.

2.Some background reading on SD cards and corruption


(Feel free to skip this section and go directly to the Linux part. Couldn't resist given some background:)


2.1 SD card background


SD cards, just like SSDs, consist out of some form of NAND flash, and a "controller" chip to talk the SD protocol to a PC, and interface with the flash and to "maintain it". NAND flash is increasibly cheap to make, but   (especially MLC/TLC/..) has serious limitiations like that you can only write a single spot 1000 times and that you regularly have to reread sections to avoid them from going bad. The controller hence tries to be clever, and tries to use the NAND flash as optimally as possible.

How well it succeeds in doing this, typically depends on two things:
 - The quality of the controller (and its firmware)
 - How the card is being used (how much data is written to it)

Afbeeldingsresultaat voor sd card controller sandisk
 (An example of someone who opened up the card. The NAND flash below and controller above are visible.)


SSDs actually support stuff like TRIM, to tell the controller which parts of the "virtual drive" are actually not used anymore, and hence can be used for the wear-leveling pool of the controller. SD cards typically don't have this feature; and hence once you write you card full once, wear-leveling is going to have to move a lot of data around very quickly, reducing the lifetime of your card very quickly.

Anyway, lots more to be said about this topic, but for this post, I'll focus a bit on what happens when stuff goes bad.

(If you are interested further in tearing down these things and getting to the bottom of them, I really also recommend checking out the post from bunniehttps://www.bunniestudios.com/blog/?p=3554https://youtu.be/r3GDPwIuRKI ) 



2.1.1 Corruption in SD cards and recovery scenarios


If the NAND flash starts failing (because it was written to too much by the controller (either because it was a stupid controller, or the controller had no choice due to usage)), the best case scenario is that the controller detects this. In fact, it typically will detect "bad blocks", which will become more and more common over time, and tries to avoid those sections in the future. If enough error correction was used, the data in the newly bad block might be recoverable, but otherwise the data is lost forever.

Controllers will typically try to avoid too much "damage" or dataloss on a card, by forcably putting the card in permanent read-only mode. Whatever data was lost will stay lost, but at more data writes (and wearleveling) are halted to prevent  further harm.

I have seriously had already 2 SD cards and 1 USB stick going permanent read-only like this.


If the card does not prevent corruption soon enough, and corruption occurs, there are a few places where it can happen:

- The "controller data" (which keeps track of which data is where in NAND (since data is moved around constantly for wearleveling purposes))
- Criticial filesystem data (e.g. for FAT, the FAT table)
- Actual data (if some JPEGs are stored, parts of them might be bad)

The second two cases are pretty much the same stuff as you could have on a regular magnetic drive. The classic recovery tools can help you out. You can still try to take a byte-per-byte backup first, and analyze data from there.

In the first case however is by far the worst. If the controller does not know which part of the NAND maps to what anymore, it might just give up altogether, and not even show up as an SD card anymore (e.g. not respond to SD card commands). Perhaps it will still respond for specific regions, or it might just show the regions as corrupt. All of this is guess work, and fully SD-card specific.

The typical "recovery" scenario in this case, is to actually physically open up the card, and extract or directly probe the NAND chip, to get the raw data out of it. What is corrupt will still be corrupt, but at least you will get all the bytes out of the chip that were still there. Also recovering a "linear" disk out of the data is not trivial, since you need the controller data  for that, which may be corrupt in the first place. Companies offer this kind of service, but it is not cheap, and still does not guarentee full data recovery obviously

 
An example of how the NAND flash in an SD card can be interfaced to directly (a commercial service offered by companies).



3. A linux recovery attempt


So, given how bad stuff might be, plugging the card into linux could show up a number of things:

- Either nothing would happen (meaning the controller would fully have given up)
- The controller would still communicate a bit, but fail to respond to certain commands  <====
- The controller would show the full drive, but its data might be a bit corrupt.

Turned out to the be second case: The SD card would not reply to commands to retrieve data from certain regions, but would reply for others. 

Since the MBR was still intact, linux correctly recognized the data partition. However, the FAT table was corrupt, and when trying to read raw data from the drive, the kernel would stall for long periods of time before even responding to a control+c / sigkill




Experimentally trying out different regions to start with, I managed to recover gigabytes of data at over 10MB/s, but eventually the transfer would always bump into a "bad" region and slow down to mere bytes per second.

Aside form the fact that it was super slow, there -was- data coming out of it. However, without a FAT, there was no way of knowing what was actually part of a file and what not. Photorec to the rescue!


3.1 Photorec to the rescue


Photorec (https://www.cgsecurity.org/wiki/PhotoRec - also an official debian package for example)  is an awesome piece of software. It simply ignores a filesystem, and tries to recover files based on their header signature. As you could have guessed, it certainly can recover pictures.





My very first run immediately turned up some pictures already! Also analyzing the few gigabytes of data that were quickly retrievable were quickly parsed using a loopback device: I first made a fast dd-copy in a region the SD card still dumped data quickly to a file, and then mounted it as virtual device for photorec to use:




This was already quite a good success (I managed to recover some 4GB this way), but by far the biggest part of the drive (the other 60GB) was littered with "slow sections". This was too much data to just ignore. However, the estimated time to go through the entire drive was more than 100 hours. 

I then turned my attention to the linux kernel. Trying to figure out -why- the kernel slowed down so hard, and seeing if I could not tweak some sort of timeout in the kernel to speed up the process


3.2 The linux kernel SD card reader driver


lsusb and lsmod quickly showed which driver I was using for the SD card reader: rtsx_usb_sdmmc

A quick inspection of the code showed that the driver supports dyndbg. Hence, we can simply enable some more verbose debugging in dmesg with the simple command:




# echo "module rtsx_usb_sdmmc +p" > /sys/kernel/debug/dynamic_debug/control

This showed output like:


 rtsx_usb_sdmmc rtsx_usb_sdmmc.0.auto: sdmmc_request
 rtsx_usb_sdmmc rtsx_usb_sdmmc.0.auto: sd_send_cmd_get_rsp: SD/MMC CMD 13, arg = 0xe6240000
 rtsx_usb_sdmmc rtsx_usb_sdmmc.0.auto: cmd->resp[0] = 0x00000900
 rtsx_usb_sdmmc rtsx_usb_sdmmc.0.auto: sdmmc_request
 rtsx_usb_sdmmc rtsx_usb_sdmmc.0.auto: sd_send_cmd_get_rsp: SD/MMC CMD 13, arg = 0xe6240000
 rtsx_usb_sdmmc rtsx_usb_sdmmc.0.auto: cmd->resp[0] = 0x00000900
 rtsx_usb_sdmmc rtsx_usb_sdmmc.0.auto: sdmmc_request
 rtsx_usb_sdmmc rtsx_usb_sdmmc.0.auto: sd_send_cmd_get_rsp: SD/MMC CMD 13, arg = 0xe6240000
 rtsx_usb_sdmmc rtsx_usb_sdmmc.0.auto: cmd->resp[0] = 0x00000900
 rtsx_usb_sdmmc rtsx_usb_sdmmc.0.auto: sdmmc_request
 rtsx_usb_sdmmc rtsx_usb_sdmmc.0.auto: sd_send_cmd_get_rsp: SD/MMC CMD 13, arg = 0xe6240000



By looking at the logging  timestamps, it could be seen how sometimes read commands could take longer than 10 seconds before going into timeout! Surely this could be done better.

After a bit of looking, I found the cause:


--- a/drivers/mmc/host/rtsx_usb_sdmmc.c

+++ b/drivers/mmc/host/rtsx_usb_sdmmc.c

@@ -530,7 +530,7 @@ static int sd_rw_multi(struct rtsx_usb_sdmmc *host, struct mmc_request *mrq)

                pipe = usb_sndbulkpipe(ucr->pusb_dev, EP_BULK_OUT);



        err = rtsx_usb_transfer_data(ucr, pipe, data->sg, data_len,

-                       data->sg_len,  NULL, 10000);

+                       data->sg_len,  NULL, 100);

        if (err) {

                dev_dbg(sdmmc_dev(host), "rtsx_usb_transfer_data error %d\n"

                                , err);





Apparently, the SD card protocol is -so- tolerant for slow controller replies (because of the the whole wear-leveling load they must bare), that the driver must take these huge delays into account. (It -might- just still reply).

As a result, this patch is not "correct" according to the SD card spec. But in my case, it managed to increase the read speed of the card dramatically! 

I quickly patched the module, and rebuilt it for the current kernel. A simple module reload (without even a reboot) fixed the slowdown issue.

After a few hours, I managed to let photorec parse the full 64GB of the card!

4. The end result


Thanks to photorec, and a small kernel driver patch I managed to recover about 30GB of pictures out of the 64GB on the card!

Given the fact that the card did not show up anymore in windows at all (probably because of the same timeout issue), this is a pretty good result for a software-only recovery!

As mentionned, doing a physical teardown of the card and reading the NAND directly might have recovered just a bit more. But for a non-invasive method, I don't think this is a bad result at all:)


Thanks again for linux and the open source community:)