Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Just imagine the amount of stress on this people, hope the money really worth it.


It shouldn't be too stressful. Well-managed companies blame processes rather than people, and have systems set up to communicate rapidly when large-scale events occur.

It can be sort of exciting, but it's not like there is one person typing at a keyboard with a hundred managers breathing down their neck. These resolutions are collaborative, shared efforts.


> It can be sort of exciting, but it's not like there is one person typing at a keyboard with a hundred managers breathing down their neck.

As someone who formerly did Ops for many many years... this is not accurate. Even in a well organized company there are usually stakeholders at every level on IM calls so that they don't need to play "telephone" for status. For an incident of this size, it wouldn't be unusual to have C-level executives on the call.

While those managers are mostly just quietly listening in on mute if they know what's good (e.g. don't distract the people doing the work to fix your problem), their mere presence can make the entire situation more tense and stressful for the person banging keyboards. If they decide to be chatty or belligerent, it makes everything 100x worse.

I don't envy the SREs at Facebook today. Godspeed fellow Ops homies.


I think it comes down to the comfort level of the worker. I remember when our production environment went down. The CTO was sitting with me just watching and I had no problem with it since he was completely supportive, wasn't trying to hurry me, just wanted to see how the process of fixing it worked. We knew it wasn't any specific person's fault, so no one had to feel the heat from the situation beyond just doing a decent job getting it back up.


C levels don't sit on the call with engineers. They aren't that dumb. Managers will communicate upward.


That greatly depends on the incident and the organization. I’ve personally been on numerous incident calls with C-level folks involved.


Yeah hell, I've ended up with one of the big names as my comm's lead.

That in itself was stressful, and became an example case later.


"it's not like there is one person typing at a keyboard with a hundred managers breathing down their neck. These resolutions are collaborative, shared efforts"

Well, you'd be surprised about how one person can bring everything down and/or save the day at Facebook, Cloudflare, Google, Gitlab, etc. Most people are observers/cheerleaders when there is an incident.


> Most people are observers/cheerleaders when there is an incident.

Yeah, a typical fight/flight response.


Or most people simply don't have anything useful to add or do during an incident.


Taking all the available slots in the massive gvc warroom ain't much... but its honest work.


Well, individuals will still stress, if anything, due to the feeling of bein personally responsible for inflicting damage.

I know someone who accidentally added a rule 'reject access to * for all authenticated users' in some stupid system where the ACL ruleset itself was covered by this *, and this person nearly collapsed when she realized even admins were shut out of the system. It required getting low level access to the underlying software to reverse engineer its ACLs and hack into the system. Major financial institution. Shit like leaves people with actual trauma.

As much as I hate fb, I really feel for the net ops guys trying to figure it all out, with the whole world watching (most of it with shadenfreude)


As one of the major responders to an incident analogous to this at a different fang... you're high, its still hella stressful.


> It shouldn't be too stressful. (...) it's not like there is one person typing at a keyboard with a hundred managers breathing down their neck

Earlier comment mentioned that there is a bottleneck, and that people who are physically able to solve the issue are few and that they need to be informed what to do; being one of these people sounds pretty stressful to me.

"but the people with physical access is separate (...) Part of this is also due to lower staffing in data centers due to pandemic measures", source: https://news.ycombinator.com/item?id=28749244


Sure, but that's what conference calls are for.

Most big tech companies automatically start a call for every large scale incident, and adjacent teams are expected to have a representative call in and contribute to identifying/remediating the issue.

None of the people with physical access are individually responsible, and they should have a deep bench of advice and context to draw from.


I'm not an IT Operations guy, but as a dev I always thought it was exciting when the IT guys had in their shoulders the destiny of the firm. I must be exciting.


You tend not to think about it…

Most teams that handle incidents have well documented incident plans and playbooks. When something major happens you are mostly executing the plan (which has been designed and tested). There are always gotchas that require additional attention / hands but the general direction is usually clear.


>Well-managed companies

To what extent does this include Facebook?


> Well-managed companies blame processes rather than people,

We're six hours without a route to their network, and counting. I think we can safely rule out well-managed.


> Well-managed companies blame processes rather than people

I feel like this just obfuscates the fact that individuals are ultimately responsible, and allows subpar employees to continue existing at an organization when their position could be filled by a more qualified employee. (Not talking about this Facebook incident in particular, but as a generalisation: not attributing individual fault allows faulty employees to thrive at the expense of more qualified ones).


> this just obfuscates the fact that individuals are ultimately responsible

in critical systems, you design for failure. if your organizational plan for personnel failure is that no one ever makes a mistake, that's a bad organization that will forever have problems.

this goes by many names, like the swiss cheese model[0]. its not that workers get to be irresponsible, but that individuals are responsible only for themselves, and the organization is the one responsible for itself.

[0] https://en.wikipedia.org/wiki/Swiss_cheese_model


> is that no one ever makes a mistake

This isn't what I'm saying, though. The thought I'm trying to express is that if no individual accountability is done, it allows employees who are not as good at their job (read: sloppy) to continue to exist in positions which could be better occupied by employees who are better at their job (read: more diligent).

The difference between having someone who always triple-checks every parameter they input, versus someone who never double-checks and just wings it. Sure, the person who triple-checks will make mistakes, but less than the other person. This is the issue I'm trying to get at.


If someone is sloppy and not willing to change he should be shown the door, but not because he caused outage but because he is sloppy.

People who operate systems under fear tend to do stupid things like covering up innocent actions (deleting logs), keep information instead of sharing it etc. Very few can operate complex systems for long time without doing mistake. Organization where the spirit is "oh, outage, someone is going to pay for that" wiil never be attractive to good people, will have hard time adapting to changes and to adopt new tech.


> The difference between having someone who always triple-checks every parameter they input, versus someone who never double-checks and just wings it. Sure, the person who triple-checks will make mistakes, but less than the other person. This is the issue I'm trying to get at.

If you rely on someone triple-checking, you should improve your processes. You need better automation/rollback/automated testing to catch things. Eventually only intentional failure should be the issue (or you'll discover interesting new patterns that should be protected against)


If there is an incident because an employee was sloppy, the fault lies with the hiring process, the evaluation process for this employee, or with the process that put four eyes on each implementation. The employee fucked up, they should be removed if they are not up to standards, but putting the blame on them does not prevent the same thing from happening in the future.


If you'd think about it, it isn't very useful to find a person who is responsible. Suppose someone cause outage or harm, due to neglect or even bad intentions, either the system will be setup in a way that the person couldn't cause the outage or that in time it will be down. To build truly resilient system, especially on global scale, there should never be an option for a single person to bring down the whole system.


By focusing on the process, lessons are learned and systems are put in place which leads to a cycle of improvement.

When individuals are blamed instead, a culture of fear sets in and people hide / cover up their mistakes. Everybody loses as a result.


I don't think the comment you're replying to applies to your concern about subpar employees.

We blame processes instead of people because people are fallible. We've spent millenia trying to correct people, and it rarely works to a sufficient level. It's better to create a process that makes it harder for humans to screw up.


Yes, absolutely, people make mistakes. But the thought I was trying to convey is that some people make a lot more mistakes than others, and by not attributing individual fault these people are allowed to thrive at the cost of having less error-prone people in their position. For example, someone who triple-checks every parameter that they input, versus someone who has a habit of just skimming or not checking at all. Yes the triple-checker will make mistakes too, but way less than the person who puts less effort in.


But that has nothing to do with blaming processes vs people.

If the process in place means that someone has to triple check their numbers to make sure they’re correct, then it’s a broken process. Because even that person who triple checks is one time going to be woken up at 2:30am and won’t triple check because they want sleep.

If the process lets you do something, then someone at some point in time, whether accidentally or maliciously, will cause that to happen. You can discipline that person, and they certainly won’t make the same mistake again, but what about their other 10 coworkers? Or the people on the 5 sister teams with similar access who didn’t even know the full details of what happened?

If you blame the process and make improvements to ensure that triple checking isn’t required, then nobody will get into the situation in the first place.

That is why you blame the process.


Yeah, I've heard this view a hundred times on Twitter, and I wish it were true.

But sadly, there is no company which doesn't rely, at least at one point or another, on a human being typing an arbitrary command or value into a box.

You're really coming up against P=NP here. If you can build a system which can auto-validate or auto-generate everything, then that system doesn't really need humans to run at all. We just haven't reached that point yet.

Edit: Sorry, I just realised my wording might imply that P does actually equal NP. I have not in fact made that discovery. I meant it loosely to refer to the problem, and to suggest that auto-validating these things is at least not much harder than auto-executing them.


I don’t think anyone ever claimed the process itself is perfect. If it were, we obviously would never have any issues.

To be explicit here, by blaming the process, you are discovering and fixing a known weakness in the process. What someone would need to triple check for now, wouldn’t be an issue once fixed. That isn’t to say that there aren’t any other problems, but it ensures that one issue won’t happen again, regardless of who the operator is.

If you have to triple check that value X is within some range, then that can easily be automated to ensure X can’t be outside of said range. Same for calculations between inputs.

To take the overly simplistic triple check example from before, said inputs that need to be triple checked are likely checked based on some rule set (otherwise the person themselves wouldn’t know if it was correct or not). Generally speaking, those rules can be encoded as part of the process.

What was before potentially “arbitrary input” now becomes an explicit set of inputs with safeguards in place for this case. The process became more robust, but is not infallible.

But if you were to blame people, the process still takes arbitrary input, the person who messed up will probably validate their inputs better but that speaks nothing of anyone else on the team, and two years down the line where nobody remembers the incident, the issue happens again because nothing really has changed.


The issue is that this view always relies on stuff like "make people triple check everything".

- How does that relate to making a config change?

- How do you practically implement a system where someone has to triple check everything they do?

- How do you stop them just clicking 'confirm' three times?

- Why do you assume they will notice on the 2nd or 3rd check, rather than just thinking "well, I know I wrote it correctly, so I'll just click confirm"?

I don't think rules can always be encoded in the process, and I don't see how such rules will always be able to detect all errors, rather than only a subset of very obvious errors.

And that's only dealing with the simplest class of issues. What about a complex distributed systems problem? What about the engineer who doesn't make their system tolerant of Byzantine faults? How is any realistic 'process' going to prevent that?

This entire trope relies on the fundamental axiom that "for any individual action A, there is a process P which can prevent human error". I just don't see how that's true.

(If the statement were something like "good processes can eliminate whole classes of error, and reduce the likelihood of incidents", I'd be with you all the way. It's this Twitter trope of "if you have an incident, it's a priori your company's fault for not having a process to prevent it" which I find to be silly and not even nearly proven.)


> and allows subpar employees to continue existing at an organization when their position could be filled by a more qualified employee.

Not really, their incompetence is just noticed earlier at the review/testing stages instead of in production incidents.

If something reaches production that's no longer the fault of one person, it's the fault of the process and that's what you focus on.


The stress for me usually goes away once the incident is fully escalated and there's a team with me working on the issue. I imagine that happened quite quick in this case...


Exactly, the primary focus in situations like this, is to ensure that no one feel like they are alone, even if in the end it is one person who has to type in the right commands.

Always be there, help them double check, help monitor, help make the calls to whomever needs to be informed, help debug. No one should ever be alone during a large incident.


This is a one off event, not a chronic stress trigger. I find them envigorating personally, as long as everybody concerned understands that this is not good in the long run, and that you are not going to write your best code this way.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: