De-duplication is where each block of storage is hashed and stored only once: if two (or more) files contain blocks which are identical, the common block will be saved only once, saving you disk space.
This is very topical for me because I've been thinking a lot about backup solutions lately and de-duplication is a feature I'd rather like. backuppc does it; dar and rdiff-snapshot do not. git does it and I've been looking at backup systems built around that.
The default settings in ZFS are to use the SHA256 algorithm for hashing and not to check for collisions. A collision would mean the newest write would skip over the block which was thought to be already stored, corrupting the file(s) referencing it.
If this is a problem for people, the ZFS folks have implemented a (costly) 'verify' feature:
if this makes you uneasy, that's OK: ZFS provies a 'verify' option that performs a full comparison of every incoming block with any alleged duplicate to ensure that they really are the same
This will detect - and crucially handle - hash collisions. Jeff Bonwicks' advice is to use a low-complexity hashing algorithm in conjunction with 'verify', to offset the additional computational workload.
There's no mention of how they resolve hash collisions. Presumably they use something like hash chains. The tradeoff involved there needs to be considered against the risk of collisions and cost of the chosen hash algorithm. I suspect most people will stick to the defaults, so this code won't get much use.
It will be interesting to see some of the discussion and performance measurements that will inevitably come out once this becomes more widely available. For now, I'm quite tempted to take a peek at their source code.