I think the most important part is separating the de-dup metadata from the data....

ghshephard · on Sept 4, 2012

This is exactly what I've been wondering - obviously cperciva is 10x (100x?) smarter in this area than I ever will be, but I kept thinking, "Just keep the digest in online storage, and the actual data in (offline?) Glacier. You don't need to access Glacier during backups to calculate digests, because you already have the digests online."

I read through his posting carefully, to see if he would capture that possibility - but didn't really see it there.

ChuckMcM · on Sept 4, 2012

Think that through a bit :-). So what counts as a 'retrieval' ? Well in the tarsnap case it appears to be a block which as far as glacier is concerned it probably thinks it is a file.

[ NB: huge huge guesses here about how tarsnap works ]

Now consider de-dupe in blocks (vs de-dupe in files) with an object store, combined with a backup. Now lets say you have a file that is 5 'blocks' long. You can get the original 5 blocks (5x fetches) then each block where deltas reside, to recreate the file you want at the time you want it.

Compare that to the 'stupid' way of doing it which is to store a full image of the file system for each time period. Where you have one fetch to get back the file from the copy of the file system that you were interested in.

But this is where the assumptions that make that stupid need to be evaluated. It is a poor choice because the expensive thing is the storage, and you trade compute cycles for storage. That is normally a winning idea because you only 'spend' the compute cycles when you reconstruct the file you want, but you continually pay for storage month after month.

Except that the pricing model of Glacier makes bulk storage cheaper and algorithmic reconstruction expensive.

Presumably the folks at Amazon are de-duping, after all they get to charge per the GB and if they can sell that exact same GB to two people, well that is a win!

So back to the question at hand.

Lets say your middleware layer is just like it is today, figures out just the deltas in your file system from the last backup and then pushes those. You've got a file system image plus a delta image and by applying the delta to the full image you can get back to the current image. However, since storage is now less expensive than compute, at the Amazon instance you take the delta apply it to the latest full backup, and create a new full backup which you then store in Glacier.

So can this possibly make sense? That is the question, if you keep full copies of the latest copy of every file, and pointers to the reconstructed previous versions in your S3 meta data. Can you get a lower net cost for the service implementation?

cperciva · on Sept 5, 2012

The deduplication is done on the client side; and yes, it is done by comparing HMACs of blocks.