Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Why Tarsnap doesn't use Glacier (daemonology.net)
210 points by cperciva on Sept 4, 2012 | hide | past | favorite | 55 comments


While I think this is a completely reasonable thing (not storing tarsnap data in Glacier), I'm not sure if the reasoning holds beyond, "For the way we designed tarsnap it doesn't make sense."

When you look at how GFS is implemented or Bigtable, or Blekko's NoSQL data store, there are layers, with the meta data as a pretty separable layer from the data. In all of the cases for these large stores that separation was put in to facilitate putting the meta data into lower latency storage than the 'bulk' data which facilitates fast access.

As cperciva relates, tarsnap spends a lot of time de-duplicating data (or writing only one copy of a block with identical contents) which is great for reducing your overall storage footprint which lowers your costs. But if the cost of storage is much much smaller than the retrieval cost, then your design methodology would be different.

So I would not be surprised if there was a way to build a function equivalent product to tarsnap that had lower storage costs if the bulk data was in glacier and the index data was in S3, or if the index data was designed such that a document recovery was exactly two retrievals (a catalog, and then the data). And of course such a system would not de-duplicate as that would result in potentially more retrievals.

It seems that if the price delta ratio between Glacier and S3 was higher than the de-dupe ratio, then Glacier would 'win' with no de-dupe. Else de-dupe would still win. Thoughts?


I'm not sure if the reasoning holds beyond, "For the way we designed tarsnap it doesn't make sense."

I agree -- that's exactly what I was saying. The title of the blog post was "Why Tarsnap doesn't use Glacier", not "Why you shouldn't use Glacier".

I think Glacier is a great service, just not a particularly good fit to Tarsnap.


I think the most important part is separating the de-dup metadata from the data. I assume the de-dup is done via either digest or HMAC -- if you keep the digests (and the blocks of digests, et al.) and pointers to the data location in Glacier in S3, couldn't you do the de-dup using S3 but actual data storage using Glacier?


This is exactly what I've been wondering - obviously cperciva is 10x (100x?) smarter in this area than I ever will be, but I kept thinking, "Just keep the digest in online storage, and the actual data in (offline?) Glacier. You don't need to access Glacier during backups to calculate digests, because you already have the digests online."

I read through his posting carefully, to see if he would capture that possibility - but didn't really see it there.


Think that through a bit :-). So what counts as a 'retrieval' ? Well in the tarsnap case it appears to be a block which as far as glacier is concerned it probably thinks it is a file.

[ NB: huge huge guesses here about how tarsnap works ]

Now consider de-dupe in blocks (vs de-dupe in files) with an object store, combined with a backup. Now lets say you have a file that is 5 'blocks' long. You can get the original 5 blocks (5x fetches) then each block where deltas reside, to recreate the file you want at the time you want it.

Compare that to the 'stupid' way of doing it which is to store a full image of the file system for each time period. Where you have one fetch to get back the file from the copy of the file system that you were interested in.

But this is where the assumptions that make that stupid need to be evaluated. It is a poor choice because the expensive thing is the storage, and you trade compute cycles for storage. That is normally a winning idea because you only 'spend' the compute cycles when you reconstruct the file you want, but you continually pay for storage month after month.

Except that the pricing model of Glacier makes bulk storage cheaper and algorithmic reconstruction expensive.

Presumably the folks at Amazon are de-duping, after all they get to charge per the GB and if they can sell that exact same GB to two people, well that is a win!

So back to the question at hand.

Lets say your middleware layer is just like it is today, figures out just the deltas in your file system from the last backup and then pushes those. You've got a file system image plus a delta image and by applying the delta to the full image you can get back to the current image. However, since storage is now less expensive than compute, at the Amazon instance you take the delta apply it to the latest full backup, and create a new full backup which you then store in Glacier.

So can this possibly make sense? That is the question, if you keep full copies of the latest copy of every file, and pointers to the reconstructed previous versions in your S3 meta data. Can you get a lower net cost for the service implementation?


The deduplication is done on the client side; and yes, it is done by comparing HMACs of blocks.


"But if the cost of storage is much much smaller than the retrieval cost, then your design methodology would be different."

If you read between the lines a bit on how the second-to-last paragraph would manifest in code, I think that's what he's already getting at. A Glaiciated machine would probably not get global de-duping, but merely a machine-level de-duping, if even that. (The retrieval cost structure is quite bizarre.)


This would indeed be true for a sufficiently large price advantage of Glacier over S3, but note that tarsnap typically saves ~99.99% by de-duplicating my data (daily backups, usually only a few e-mail and code files have changed.) I imagine that the typical usage pattern may be a bit less extreme, but the price differential would have to be huge to be worth it.


I don't know if only sending the changes is considered "De-Duping" (though it is related) - That's more equivalent to doing differential (or block level differential) backups.

De-Duping is if you add the same file (or portions of a file) in 10 different directories - the data is only stored once in backups.

Dropbox takes it a step further - if 1,000 people store the xcode DMG on their dropbox store, only one copy is stored in Dropbox.


I don't know if only sending the changes is considered "De-Duping" (though it is related) - That's more equivalent to doing differential (or block level differential) backups.

The way Tarsnap does things, it's the same thing; Tarsnap splits data into blocks and deduplicates those blocks without caring whether each block is a whole file or part of a file or several small files stuck together.


When calculating dedupe ratios you should be careful what baseline you use. Performing full backups every day is generally not realistic. I would compare dedupe-based systems against classical incremental backups.


Based on the article it seems like the motivation for the deduplication is as much bandwidth conservation as reduction of storage costs. What you're proposing would make sense if storage costs are your only concern, but it shuts out bandwidth limited users.


An excellent point. So capturing the bandwidth savings and the storage cost savings suggests an interesting mid-tier between longer term storage and shorter term.


There's no bandwidth cost associated with transferring data TO Tarsnap (that is, data transfer into EC2/S3 is free of charge), and if you have to get a file out the cost is the same whether you store files or "blocks" (I'm assuming the block disassembly is done on EC2).

So our hypothetical Tarsnap alternative could store file metadata in S3, recently modified files in S3 also (quicker retrieval) and push older files into Glacier (can be retrieved when needed, cheaper storage if (as is likely) never accessed).

It will take marginally more time and bandwidth to transfer a file every time there is a change than to only transfer changed blocks (i.e. parts of files that have changed) but for 90% of users I would bet the (significant?) decrease in long-term storage cost would make that worthwhile.


If you read the article, you don't have to assume (incorrectly) that the client sends the whole file for each backup. Indeed, a significant motivation is to avoid sending very large files across bandwidth that is constrained (or costly) on the client side.

> I currently have about 1500 such archives stored. Instead of uploading the entire 38 GB — which would require a 100 Mbps uplink, far beyond what Canadian residential ISPs provide — Tarsnap splits this 38 GB into somewhere around 700,000 blocks, and for each of these blocks, Tarsnap checks if the data was uploaded as part of an earlier archive.


That's not what I assumed at all, please re-read my comment.

I'm saying that if it meant using Glacier was possible, doing delta-backups on files, rather than blocks, might not be a bad tradeoff.


"There's no bandwidth cost associated with transferring data TO Tarsnap"

Except for those of us who's outbound bandwidth isn't free and/or unlimited. It makes no difference what Amazon charges Tarsnap for it - I can't upload all of my 128G SSD every hour - I just don't have the bandwidth to do it, and if I _did_ have the bandwidth, it'd probably send me broke pretty quickly (or have my ISP throttle me or cut me off).

It's not a Tarsnap cost, but it can be a Tarsnap user cost, and I suspect cperciva considers that just as much a "real cost" as actual monetary expenses to Tarsnap.


The point of a dedupe-based service like Tarsnap is that while logically you upload the whole disk, physically only the changed blocks are sent to Tarsnap.

It is extremely convenient.


Well, it's possible to make the deduplication work with Glacier, though. Colin's right that you want to phase versions of the whole tar into slower and cheaper storage, but the technical problem of whether blocks are used in newer versions doesn't actually seem to be too much of a problem. You can compute it in a batch for extant tars, and track it for new ones. (And you can keep blocks-of-references always in S3 rather than Glacier, say.) The problem he identifies as most troublesome is what if you see a case where you would want to deduplicate, but the similar block is in Glacier, and so would be a bottleneck for the whole deduplication process. In this case, you could always treat it as a non-match, right? And optionally have some way to track it and then when it gets moved to Glacier either determine whether it was actually a match all along, or just have some duplicate data.

Of course, I don't know how tarsnap names its blocks or stores them, so I don't know how feasible it is to have two blocks with the same name because they had the same hash but there was a byte-by-byte mismatch, or if that's even a problem.

I mean, blocks that have been moved to Glacier because there are no references to them from indices on S3 can be assumed to be less likely to show up in new archives. It's a trade-off, but my experience with deduplication is that it's often not much of a trade-off to get rid of old things even though the magical thinker in me is tempted to think "but what if that chunk happens to show up again somewhere else!?"


Glacier is simply meant to solve a different problem than S3.

I will be using Glacier for storage of litigation hold data -- potentially many terabytes of stuff that is being held because a plaintiff asked to hold "everything" and a judge agreed.

So I'll pay to store this stuff for a few years, and in the off-chance that I need to retrieve it, the other party will pay at least half. (A powerful incentive for them to not ask for the data to be retrieved in the first place!)

Basically, you want the data that your sending there to be contiguous -- no fancy de-duping or incrementalism. If you are going to use it for backup, just tar up everything and send it up in blocks that make sense from a recovery cost POV.


Another reason, which I decided deserved to be written up as a separate blog post, is the surprising nature of the "data retrieval" component of Glacier's pricing model: http://news.ycombinator.com/item?id=4475319


I like articles like this.

1. Explains what the new whiz-bang technology is.

2. Explains what Tarsnap is.

3. Provides technical explanation that lists both the upsides of the new technology along with the downsides.


Thanks! I like blog posts like this too -- they appeal to my didactic instincts.


Agreed. Too often we hear of new tech via "this is the best thing ever" commentary, followed a few months later by the inevitable "actually that thing sucks" commentary.

Much better is "here are the trade offs, based on actual usage" commentary like this. I wish there was more stuff like this during the first wave of NoSQL mania.


I'd bet that this use case is pretty common: I have only a small amount of data (a few hundred MB in my case) that I deem day-to-day critical, but hundreds of GB that I don't want to lose. My backup solution is effective but primitive: parchives on two offline hard drives, one of which is at my parents' house seven hours away. Backing up takes little effort but it's easy to forget, especially at the "remote site".

It's really annoying that there's lots of easy backup solutions for online data, but nothing for cheap backup of large amounts of data that can afford to be offline for a while. Glacier is the perfect solution but I dread having to figure out whatever I have to do--divide things into a 180-part archive and download it over a month?--to get data back. The first glaciated backup solution on the market that I think I can trust will get my dollars almost immediately.


Some people might not be aware that CrashPlan supports backing up between your own PCs (eg. to your parents) for free:

http://www.crashplan.com/consumer/crashplan.html


Your remote site method is my method too. I'm not worried about city-wide catastrophes, so I keep the spare drives closer for easier updating.


I was thinking about one time, single shot backups, but encrypted wtih tarsnap. No dedup. After all, storage is cheap and this is supposed to be zuper worst case scenario stuff, so maybe I don't want to go to the trouble of reconstructing the backup. I want to know it's there. But my interest is only theoretical.


If you have a command line tool to access Glacier, I think what you can do is make an archive (with tar), encrypt it with gpg, and just use the command line tool to upload the encrypted archive. So you don't need tarsnap for that, but if you want fancy deduplicating incremental encrypted backup, that's where tarsnap shines.


Yes, but I'm already using tarsnap. Running it with "--backup-to-glacier" is easier than rolling my own. I don't want two passwords, two super duper secret keys, two online services I have to pay, ....


Sounds like setting up a Glaciated Tarsnap machine would do what you want, then? Or maybe I'm misunderstanding what you're saying.


More or less. But I may also want to use the same machine key to perform smaller incremental backups (smaller data set), and I definitely don't want to have to print out two different keys to store in my safe. Just thinking here.


Does anyone know of products that already incoorperate Glacier. Basically, I am looking for a solution to back up my photo library. It's one big file with a "Write Once, Read hopefully never" Access Policy. I guess I could go and push it to Glacier by myself, but I'd rather have a library/product do it for me


Glacier was rolled out on August 21st - I don't think anyone has rolled out code to support it in a consumer friendly fashion - but drop a line to http://www.haystacksoftware.com/arq/ and encourage them to give you the ability to target Glacier for your Archives, and I bet you'll see something fairly soon.


My startup Degoo (http://degoo.com) is working on it. We're building a P2P Backup system were we will be using Glacier as a fall-back whenever we don't have enough reliable peers. We're planning on launching our private beta in the coming months.


I'm confused about the reason a block can't exist in both S3 and Glacier at the same time, if the deduplication code decides that block is needed in a new archive.

Why couldn't you simply have a rule that each file is either S3 or Glacier, and S3 lists-of-blocks can only reference other S3 blocks, while Glacier lists of blocks can only reference other Glacier blocks?

In the worst case, where every block was in both archives, this would only increase costs by 10% if Glacier costs a tenth what S3 costs.


What happens when you decide that you don't want that new archive to be in S3 any more and tell Tarsnap to migrate it over to Glacier?


To S3 it's like you just deleted the file; to Glacier it's like you just created the file.


Except that then we're trying to store two different blocks in Glacier with the same hash ID.


From that I assume that if a block's hash matches something that's already in the archives then you retrieve the archive block(s) with the same hash ID in order to verify that it is exactly the same (byte-for-byte)? And this wouldn't be possible with Glacier as you can't just retrieve the block from storage to check there and then.

Do you have any stats on the number of collisions you've seen?


I've got Terabytes of data that need to be archived and possibly recalled later (trading logs and market data preserved in case of an audit). I'm not at the scale of Glacier. If TarSnap had a readonly vault (files in said vault would never change) then it would be able to distinguish the files and split between the Glacier and S3 offerings.


Drop a note to http://www.haystacksoftware.com/arq/ - I'll wager they will have an option to create "Glaciered Archives" of folders within a few months. The developer there has already expressed some interest, and suggested that Arq would lend itself to pointing at a folder and storing it off to a Glacier Archive (with all the caveats around 4 Hour delay, 24 Hour of Availability, Costs associated with restores, etc...)


[I'm the developer] I'm hoping storing the metadata in S3 and the actual file data in Glacier would work well. I'm still looking into it.

[edited for clarity]


That's awesome news. Just being able to take a folder on my Drive, and say, "Archive this 5 Gigabytes of photos for the next 50 years" - and know it will cost me around $30 (and, as time goes on, likely less as the storage drops) and I won't have to worry (much) about it, will be a big win - even if you don't come up with an easy way of integrating with the normal S3 backups - I bet a lot of people will love that feature.


I think the best use-case for Glacier is large files that you would absolutely hate to lose. The thing that people cry over when their laptops get stolen.

For home use, all the family photos and videos. An archive of your emails (outlook.pst) because a lot of important data is stored in there. All your taxes and accounting data from years past. Bulky stuff that takes up space on your home system, but isn't used daily.

In business, many companies use "Iron Mountain" to archive their paperwork. Old invoices, reports, things that were important in the past and may someday be needed. That's what Glacier is for.

Glacier is archiving, not backup. You might want to take advantage of cheap storage by keeping your backups there, but that's a different issue.


I don't get the impression Glacier is right for any home user. The pricing scheme is just too weird, and there are weird delays and time limits involved. I believe you have 24 hours to retrieve your data before incurring more fees when you do a retrieval from Glacier--but if I store a backup of 150 GB of photos and music and fetch the whole thing, I'm not going to be able to get that from Amazon to my home computer in one day. Avoiding that problem will necessitate other expenditures.

In order to avoid the retrieval costs you'll have to limit retrievals to a small fraction of what you have stored. In my example, if I lose all my photos and music I'm going to want to restore the whole thing. Ignoring the transfer problem above, you're looking at paying for transferring 95% of the archive, 142.5 GB. I find Glacier's pricing model so difficult to comprehend I couldn't even guess at what that would cost, but Colin's math shows that what looks like it should cost $0.02 (retrieving 2 GB once at $0.01 per GB) winds up costing $3.60 (peak rate, percentage of archive fetched, etc.), so I wouldn't hold out a lot of hope for our use case of fetching the entire 150 GB archive.

As you say, this is certainly something businesses that need to archive lots of stuff may be able to use (if they can navigate the pricing structure) I just don't see a home user getting anything out of it but frustration and a confusing bill.


Assume I store 150 gigabytes of family photos. I pay $1.50 a month ($18 a year) for storage. Assume I've uploaded the files in 150 * 1 gigabyte archives.

I decide to retrieve it all. The 5% (6 gigabyte) retrieval allowance is negligible. The data transfer out fee at $0.120 per gigabyte will cost $18.

If I retrieve 150 * 1 gigabyte chunks every hour retrieval will take 1 hour; the peak hourly retrieval will be 150 gigabytes; the data rate will be 341 Mbps; and the retrieval fee will be 150720$0.01 = $1,080

If I retrieve 7 * 1 gigabyte chunks every hour retrieval will take 150/7~=22 hours; the data rate will be 15 Mbps; the peak hourly retrieval will be 7 gigabytes; and the retrieval fee will be 7 * 720 * $0.01 = $50.40

If I retrieve 1 * 1 gigabyte chunks every hour retrieval will take ~7 days; the data rate will be 2.2 Mbps; the peak hourly retrieval will be 1 gigabytes; and the retrieval fee will be 1 * 720 * $0.01 = $7.20

If I share an account with 20 other people with the same amount of data stored, the 5% allowance would be enough for my entire download; I could retrieve as quickly as I liked without incurring a retrieval fee. I would still pay the $18 data transfer out fee.

TLDR: Retrieval isn't as cheap as storage, but if you lost all your family photos, you'd probably be willing to pay it.


Thanks for doing the math on this!


Indeed the pricing is difficult to estimate. I followed along with [1] in a spreadsheet to get a feel for it for my usage.

In my case I am looking at making a full backup of a 1TB usb disk. Doing a full restore on a 20Mbit/s (= 9 GB/hour) would probably take about 5 days. In that case I'll be paying $65 in retrieval fees for the full restore. (I'm hoping that includes bandwidth costs, that isn't clear to me from the FAQ).

$65 for a full restore seems reasonable for something I expect to never need to do.

[1] http://aws.amazon.com/glacier/faqs/#How_will_I_be_charged_wh...


You got a problem right away in that Glacier only keeps your files on the staging disk for 24 hours. Unless you're planning on doing your disk in a set of separate archives, you're going to run into issues. And if you do separate archives you have to worry about the $0.05/thousand requests problem.

You do still have to worry about bandwidth costs, $0.12 per GB, which adds another $120 to the cost of your restore: http://aws.amazon.com/glacier/pricing/

I can't speak for you but restores that cost $185 and take 5 days sound like a losing backup strategy for me.


For home use, it seems easier to put or more extra external drives (encrypted with, say, TrueCrypt) at some trusted locations, like the houses of friends.


Great write up, I've been curious about it, but I'm not particularly bothered. I think what I'm going to end up doing is to basically tar ~/media, pipe it through GPG, and pipe it through some upload-to-glacier script. Maybe automate that once a month or something and then have tarsnap for much smaller/rapidly changing stuff. As it stands now I have 100gb in ~/media that's costing me about $1/day, so it'd be nice to reduce that.


Cperciva,

Could a similar system to Tarsnap exist for Windows? I'm not asking you to implement it, I'm just curious.


Tarsnap runs without any problems under cygwin/windows right now. It is very easy to set up: the Tarsnap downloads page lists all the dependencies - you just need to make sure to include them in the Cygwin setup and then run his makefile. Frankly, his setup instructions are far more easier to follow than certain other gui-based installs that look shiny but induce far more stress trying to guess how to opt-out of the complementary crapware (e.g., Skype, Adobe Reader, utorrent...)

Also, as he notes in his install page, it _definitely_ is worth checking out the source code: one of the cleanest C source I've ever seen, and very educational.


Thank you for the info!!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: