As I've always said, "You can never protect a system from a stupid person with r...

akerl_ · on May 27, 2014

If you think that "hire great sysadmins" prevents somebody from fatfingering, you must be hiring from some more evolved species. Nobody is immune to mistakes; preventing this kind of issue is something the infrastructure and procedures should do.

llamataboot · on May 27, 2014

I don't think "just hiring great sysadmins" is possible. People have off-days or are tired or sick, new people get on-boarded, even great people make mistakes, etc.

protomyth · on May 27, 2014

...or accidentally switch which of the 25 term sessions they had open

I tend to color my production terms in a red background / yellow font scheme. It tends to inspire the tired brain to understand you are in production.

incision · on May 27, 2014

Not only do you consider mistakes the province of stupid people doing dumb things, but you're crediting yourself with a proverb about it and suggesting that you posses the ability to sniff these people out from the 'great' ones.

Get a grip, you're recursively full of yourself.

SEJeff · on May 28, 2014

Wow HN seriously!? I never once pretended that I'm able to hire people who don't make mistakes, only that you can't protect systems from administrators who mess up.

Get a grip people.

wmf · on May 27, 2014

So don't give anyone root on an entire data center.

akerl_ · on May 27, 2014

Is this like Captain Planet? It's a bit exceptional to divide access servers of similar type between administrators such that individuals have full access to a portion of the fleet. Do they meet up and put their rings together to roll out updates? What if one of them goes on vacation?

lmm · on May 27, 2014

There are keysharing protocols; you can do something like 5 sysadmins have a split of the master key such that any 3 of them can access the master account.

akerl_ · on May 27, 2014

For day-to-day maintenance of systems, that's crippling. If I need 2 cosigns to run "date" across the fleet while I'm troubleshooting an NTP issue, and then 2 cosigns again to run "service ntpd status", and so forth, my coworkers will have lit my desk on fire long before I fix the clocks.

There are definitely use cases for keysharing systems like you describe: if we're talking about getting access to a database with sensitive information, or signing a new cert that all our systems are about to put their full faith in. But for the day-to-day administrative efforts, it's overkill and ends up being counterproductive: after a certain point, Alice and Bob write scripts that let them hotkey signing off on my requests.

rodgerd · on May 27, 2014

I'm not worried about how crippling that sort of scenario is on a day to day basis, because presumably the company doesn't mind paying a fortune for a bunch of people to sit around to hold one anothers' keys.

I worry about those policies when the shit hits the fan and you're trying to fix a production problem hobbled by an inability to do stuff without three fingers on every keystroke.

akerl_ · on May 27, 2014

Agreed. Ideally, whatever system is in use for managing infrastructure provides sanity checks while I'm working, but either gets out of my way or can be sidestepped if need be. I don't want to be crippled by technical red tape when things are on fire.

aiiane · on May 27, 2014

"date" and service status don't typically require root.

richardkmichael · on May 27, 2014

I've not needed this, but it's a nice idea. Do you do this with a combination of sudo/PAM|pubkey auth? I can google, but can you push me off in the right direct? Thanks!

lmm · on May 28, 2014

I've not been directly involved, so your googling may well be as good as mine; on a quick look you might have to do this manually using ssss (and then each person encrypts their piece with gpg --symmetric or the like).

0xbadcafebee · on May 28, 2014

Actually, capabilities makes it trivial to lock down things like shutdown for admin accounts. A script can do the shutdown instead in a more controlled and less error-prone fashion. Same for network device updates. Abstraction.

mikeash · on May 27, 2014

That has its own risks. There might be some catastrophe that need root access on everything to fix, and you can't reach enough people to get it....

vertex-four · on May 27, 2014

Then put the keys to datacenter-wide root somewhere safe (with a manual-ish process to access and use them), but out of the way and with alarms on it (the same alarms that you'd use in the absolute worst situation possible). Make sure anyone using it will be shamed if they don't absolutely have to.

akerl_ · on May 27, 2014

Shame is a terrible tool for ensuring compliance. The people you want to keep will resent the fact that you're using shame as a motivating factor.

jsmthrowaway · on May 27, 2014

If you think keys in a safe is a good idea, ask a Googler about the legend of the Valentine safe. Short version: nobody was able to get into the safe and a locksmith had to come drill it to restore a critical service.

It's also a cautionary tale about testing your DR occasionally.

Gravityloss · on May 27, 2014

You're going to need a bigger crew.