File rescue with dd and gawk

I recently had to undelete some accidentally deleted pictures on some SD card, after the owner of it was trying out different tools, and even brought it to some computer store (which tried more tools), but was only able to recover half of them. It was clear that the files have been deleted, only, but not overwritten, as he noticed his mistake immediately and refrained from using the card afterwards. The way default deletion usually works, means that pretty much everything still had to be recoverable.

Turns out it was, and even without any rescue tool. When I started looking into it, the first tool I saw in the FreeBSD ports was magicrescue, but somehow no matter what I tried, it always exited with the same error. Looking at the man-page I noticed right at the beginning:

It looks at "magic bytes" in file contents, [...] It works on any file system,
but on very fragmented file systems it can only recover the first chunk of each
file.  These chunks are sometimes as big as 50MB, however.

So, the tool is file-system agnostic, it seems to only look for some sequence of bytes and then to recover some sequential number of bytes. This also means that very complex file-systems or features like compression and deduplication will obviously not be suited for recovery with magicrescue.

Makes sense. And that applies also to my case: the card had a FAT32 file system on it (like probably most cameras use), meaning there won't be any fancy file system features. Also, given that a camera stores one picture after the other (and if people delete some it's often always the last right after taking it), there probably also is little fragmentation.

So, basically, all I need to do is read all bytes off of the card, and split on certain patterns. split(1) unfortunately doesn't help, as although you can use a pattern for splitting, it's only matching on entire lines.
Inspecting the first few megabytes on the SD card revealed, that the images on there are stored as Exif-JPEG files (starting with magic numbers 0xff 0xd8, and then 0xff 0xe1 for this subtype, details here). This is not something general purpose, of course. And even for this one type of JPEG file not something to rely on, but I didn't want to split on 0xff 0xd8, only (to keep false positives low), and assumed that the camera wrote all images in the same format/way.

Completely ignoring the end-markers of JPEG files, accepting that the recovered images might have some garbage data appended, I started splitting the data on the SD card up on those 4 byte patterns. And that works quite nicely with dd and gawk (note, POSIX awk won't work, as the record separator can only be one byte):

dd if=$SRC of=/dev/stdout bs=1M | \
  gawk 'BEGIN { FS="fs is not important"; RS="\xff\xd8\xff\xe1" } { print RS$0 > sprintf("%04d.jpg", NR) }'

Of course, set $SRC to the device you want to recover your files from.

That's it - I was able to recover every single image off of that card, with a shell one-liner! Of course, this is a specific case that made this possible: simple file system, no fragmentation, only JPEG files to recover, and only one JPEG type to look for, etc., but it can easily extended to suit other purposes.

Here's a little bit more convenient version as a shell script, allowing to seek, set the size to recover, and an optional prefix for the recovered images (still only looking for the same 4 bytes to separate on, though):

#!/bin/sh
if [ $# -lt 3 ]; then echo Usage: $0 DEV SIZE_MB SEEK_MB OUT_PREFIX; exit; fi
dd if=$1 of=/dev/stdout bs=1M count=$2 iseek=$3 | gawk 'BEGIN { FS="fs is not important"; RS="\xff\xd8\xff\xe1" } { print RS$0 > sprintf("'$4'%04d.jpg", NR) }'