jmtd → log → Backup Data Mining
What if you have a mis-behaving electronic device, which might corrupt files you store on it? I'm interested in tools to inspect incremental backups, look at file changes and alert me if they are suspicious. For example, if an MP3 changes, did the MPEG data change, or the ID3 tag, or something else? An ID3 change is likely to be on purpose, MPEG data less so. Similar for JPEGs and EXIF metadata. I decided to poke around in my backups of my music player. I discovered some pretty weird stuff.
Firstly, quite a few MP3s seem to have changed on my player over time. I found 788
instances of MP3s changing a very small amount: an 11 byte
This seemed particularly strange, so I investigated a little:
... 01 Fool's Day.mp3.2012-05-31T22:40:09+01:00.diff 00000000 72 73 02 36 47 00 00 59 01 ad 00 |rs.6G..Y...| 0000000b dark_entries.mp3.2012-05-31T22:40:09+01:00.diff 00000000 72 73 02 36 47 00 00 54 0f 58 00 |rs.6G..T.X.| 0000000b ...
In a nutshell, the patches appear to do nothing, or at least copy the input
file verbatim into the output. (my rdifffs
source is a useful
document for the
rdiff file format, or failing that, the
itself.) This is probably just a behavioural nuance of my backup software.
That leaves 156 changed MP3s with
rdiff patches ranging in size from 21 bytes
to 26k. The smallest are almost certainly no-ops, just less efficiently stored
ones. The largest looks like embedded album art in the ID3 tag being added,
and I'm guessing the mid-size ones are ID3 textual changes (spelling
corrections etc.), but ID3 changes are very hard to inspect by eye in the raw
For this reason I think a tool that could sort through file changes and pick out things which might need human investigation might be useful. Such tools could be run automatically after backups complete, or at scheduled times. I'll probably start writing some over the next few weeks, but if you know of any that already exist or might form part of a solution, please let me know!