jmtd → log → Backup Data Mining
What if you have a mis-behaving electronic device, which might corrupt files you store on it? I'm interested in tools to inspect incremental backups, look at file changes and alert me if they are suspicious. For example, if an MP3 changes, did the MPEG data change, or the ID3 tag, or something else? An ID3 change is likely to be on purpose, MPEG data less so. Similar for JPEGs and EXIF metadata. I decided to poke around in my backups of my music player. I discovered some pretty weird stuff.
Firstly, quite a few MP3s seem to have changed on my player over time. I found 788
instances of MP3s changing a very small amount: an 11 byte rdiff
-format patch.
This seemed particularly strange, so I investigated a little:
...
01 Fool's Day.mp3.2012-05-31T22:40:09+01:00.diff
00000000 72 73 02 36 47 00 00 59 01 ad 00 |rs.6G..Y...|
0000000b
dark_entries.mp3.2012-05-31T22:40:09+01:00.diff
00000000 72 73 02 36 47 00 00 54 0f 58 00 |rs.6G..T.X.|
0000000b
...
In a nutshell, the patches appear to do nothing, or at least copy the input
file verbatim into the output. (my rdifffs
source is a useful
document for the rdiff
file format, or failing that, the rdiff
source
itself.) This is probably just a behavioural nuance of my backup software.
That leaves 156 changed MP3s with rdiff
patches ranging in size from 21 bytes
to 26k. The smallest are almost certainly no-ops, just less efficiently stored
ones. The largest looks like embedded album art in the ID3 tag being added,
and I'm guessing the mid-size ones are ID3 textual changes (spelling
corrections etc.), but ID3 changes are very hard to inspect by eye in the raw
bytestream.
For this reason I think a tool that could sort through file changes and pick out things which might need human investigation might be useful. Such tools could be run automatically after backups complete, or at scheduled times. I'll probably start writing some over the next few weeks, but if you know of any that already exist or might form part of a solution, please let me know!
Comments
bup
integrates withpar2
which can create parity data. This enables you to both verify and repair your backup sets in one operation, and the more space you allocate to the parity data, the more redundancy you have. Maybe par2 is not useful directly in this case, but the existing solution might be interesting to you.Thanks for the suggestions!
I might consider
bup
again for this. My main concern about it is that you cannot throw away old increment data: that is, the backup repository can only grow.In the specific case of MP3 files, you could try http://snipplr.com/view/4025.5422/
It calculates an MD5 hash over only the music portion of the MP3, so if you (or a media player) change some id3 info (or whatever meta info), you will still be able to check if your music stayed intact. Also great for finding duplicates in your MP3 collection, if MP3 files with the same music stream came from different sources and thus have different ID3 tags.
I haven't actually looked at or tested it yet, but a quick Google search came up with http://code.google.com/p/diffmp3/ - granted, it dates back to 2009, but this doesn't really mean that it's not at least partly functional or maybe salvageable.
Sorry if you've already looked at it and found it wanting.