I wrote a quick proof-of-concept script that can traverse my email (in local Maildir format) and detach all attachments: saving them to another location and replacing them with a text attachment which documents where they've gone.

This reduces my work mailbox size substantially and allows for more space-efficient storage of the attachments (utilising all 8 bits of the byte, at the very least, but with the potential for compression to be added to the mix). It maintains the link between email and attachment, however, which is crucial if the method of discovery of the attachment in the future is to find the email.

For this to be future proof, the path where the mail is saved needs to be. The best scheme I can think of involves dedicating a sub-domain to the problem and using a URI scheme which is guaranteed1 to be unique for the attachment, such as a hash-sum:

https://file.example.com/sha1/509c2fe2eba509e93987c3024a74d74583c274bd

I could finish up my script, set my my own personal infrastructure for this and convert my mailbox with a few hours more work. Before I do, I have two questions:

  • Is this already available in an open source tool: am I re-inventing the wheel needlessly?
  • Would anyone else find this useful?

  1. within the practical limitations of hashing algorithms, at least.

Comments

comment 1
I wonder if there is need for a global, unique standard for URLs refering to files by their hash. hash:///sha1/509c2fe2eba509e93987c3024a74d74583c274bd ? http://sha1.hash.arpa.in/509c2fe2eba509e93987c3024a74d74583c274bd? sha1://509c2fe2eba509e93987c3024a74d74583c274bd? I’m sure there are more applications that could benefit from that. A quick search does not turn up anything existing, though.
nomeata,
comment 2

Actually, magnet links seem to be most suitable for this: http://en.wikipedia.org/wiki/Magnet_URI_scheme

nomeata,
comment 3

I think it would be useful to myself, and others.

Although it isn't directly analogous I suspect there are things already that do that - the most obvious thing that springs to mind is those systems which parse and insert mails into databases, for example:

http://www.dbmail.org/

Steve Kemp,
comment 4
There is an RFC that lets you put email attachments on an external server and simply put a URL to the attachment into the email. I completely forget which one it is but I strongly suggest that you do that.
Anon4EtE4ype,
comment 5
My preference would be hash:///sha1/xxxxxxxxxxxxxxxx, and this is a tool sounds extremely useful please release!
Barak A. Pearlmutter,
comment 6

For more human-readable filenames, you could use something like:

https://file.example.com/$MESSAGE_ID/$ATTACHMENT_NUMBER-$FILENAME

The Message-ID is already "nearly unique", and in the unlikely event of a collision just check for the existence of the $MESSAGE_ID subdirectory and add a suffix "-01" (or "-02", "-03", ...) to $MESSAGE_ID

and you could make $ATTACHMENT_NUMBER zero-padded 2-digits for sorting.

cas,
comment 7
something else just occurred to me, using the hash saves on space for duplicates when multiple messages have the same attachment, which is a good thing...so you could store them with a hashing scheme, but also have a symlink farm using the Message-ID/AttachmentNo-Filename naming scheme above for convenient access.
cas,
comment 1
I used to handle a large volume of messages and I speak from experience when I say that a lot of mail servers out there will give you non-unique or even empty Message-ID headers.
Steve Kemp,
comment 9
A neat addition would be a fuse file-system which takes a Maildir with "offloaded" attachments and the directory that contains the attachment files as input, and provides a file-system containing a Maildir where the attachments appear inline as they were originally.
Comment by Anonymous,
comment 10
@Anonymous: Crikey that's ambitious! A good idea, perhaps a fun programming challenge.
jon,