It's operations. You fuck up, you suck it up, you fix it, then (and this is the ...

myrandomcomment · on May 27, 2014

Back in the day we used to say there are 2 types of network engineers, those that have dropped a backbone and those that will drop a backbone.

stephengillie · on May 28, 2014

Your management is failing your newer engineers, if this is still the case.

macavity23 · on May 28, 2014

Nonsense. Someone has to be operating at the sharp end of the enable prompt, and sooner or later it'll be 0330 and that person will type Ethernet0 when they meant Ethernet1, whatever management you have in place.

When that happens, you do just what Joyent did here: you send out an embarrassed email to customers, everyone else in the ops team gets a few cheap laughs at the miscreant's expense, you have a meeting about it, discuss lessons learned, and you move on.

Everyone screws up. Everything goes down once in a while. This is why you build in redundancy at every level.

LnxPrgr3 · on May 28, 2014

I've seen generally brilliant people be bit by bad process. The worst example was an important hard drive being wiped thanks to a lack of labeling, obviously taking a production server down with it.

Other things that have caused outages: lack of power capacity planning, unplugging an unrelated test server from the network (go go gadget BGP), cascading backup power failure, building maintenance taking down AC units, expensive equipment caching ARP replies indefinitely… the list goes on.

I had my own fun fuckup too. I learned SQL on PostgreSQL, and had to fix a problem with logged data in a MySQL database. Not trusting myself, I typed "BEGIN;" to enter a transaction, ran my update, and queried the table to check my results. I noticed my update did more than I expected, so I entered "ROLLBACK;" only to learn that MyISAM tables don't actually implement transactions.

Thankfully, in this case it turned out to be possible to undo the damage, but talk about a heart-stopping moment!

Shit happens. You deal with it, then do what you can to keep it from happening again. I've learned to respect early morning change windows as a way to limit damage caused by mistakes.