Hacker Newsnew | past | comments | ask | show | jobs | submit | l9i's commentslogin

Damien Miller


> And I think Google has a backup IRC server on AWS, but that might just apocryphal.

They (>1) do exist but not on AWS or any other major cloud provider.

(Or at least that was the case two years ago when I still worked there.)


If you enjoy Organic Maps, please consider supporting them financially: https://organicmaps.app/donate/. Supporting their serving infrastructure for numerous users is surely costly.


Right now infrastructure appears to be donated, but donations increase how much effort developers can spend on it!


Most outdoors activities are safer and more pleasant when there is some sunlight (think: walking through a forest, hiking, swimming in a lake, sports, having a picnic, etc).


The safe in question contained a smartcard required to boot an HSM. The safe combination was stored in a secret manager that depended on that HSM.

The engineer attempted to restart the service, but did not know that a restart required a hardware security module (HSM) smart card. These smart cards were stored in multiple safes in different Google offices across the globe, but not in New York City, where the on-call engineer was located. When the service failed to restart, the engineer contacted a colleague in Australia to retrieve a smart card. To their great dismay, the engineer in Australia could not open the safe because the combination was stored in the now-offline password manager.

Source: Chapter 1 of "Building Secure and Reliable Systems" (https://sre.google/static/pdf/building_secure_and_reliable_s... size warning: 9 MB)


Lovely.

Safes typically have the instructions on how to change the combination glued to the inside of the door, and ending with something like "store the combination securely. Not inside the safe!"

But as they say: make something foolproof and nature will create a better fool.


I'm sure this sort of thing won't be a problem for a company whose founding ethos is 'move fast and break things.' O:-)


That's not quite how it happened. ;)

<shameless plug> We used this story as the opening of "Building Secure and Reliable Systems" (chapter 1). You can check it out for free at https://sre.google/static/pdf/building_secure_and_reliable_s... (size warning: 9 MB). </shameless plug>


I can assure you that Google has a procedure in place for that.


I unfortunately cannot edit the parent comment anymore but several people pointed out that I didn't back up my claim or provided any credentials so here they are:

Google has multiple independent procedures for coordination during disasters. A global DNS outage (mentioned in https://news.ycombinator.com/item?id=28751140) was considered and has been taken into account.

I do not attempt to hide my identity here, quite the opposite: my HN profile contains my real name. Until recently a part of my job was to ensure that Google is prepared for various disasterous scenarios and that Googlers can coordinate the response independently from Google's infrastructure. I authored one of the fallback communication procedures that would likely be exercised today if Google's network experienced a global outage. Of course Google has a whole team of fantastic human beings who are deeply involved in disaster preparedness (miss you!). I am pretty sure they are going to analyze what happened to Facebook today in light of Google's emergency plans.

While this topic is really fascinating, I am unfortunately not at liberty to disclose the details as they belong to my previous employer. But when I stumble upon factually incorrect comments on HN that I am in a position to correct, why not do that?


In future news: Waymo outage results in engineers unable to get to data center. Engineers don't even know where their servers are.


Give us the dirt on how google does it's disaster planning exercises please! Do you do these exercises all at once or slowly over the year?


Interesting that you are asking for the dirt given that DiRT stands for Disaster and Recovery Testing, at least at Google.

Every year there is a DiRT week where hundreds of tests are run. That obviously requires a ton of planning that starts well in advance. The objective is, of course, that despite all the testing nobody outside Google notices anything special. Given the volume and intrusiveness of these tests, the DiRT team is doing quite an impressive job.

While the DiRT week is the most intense testing period, disaster preparedness is not limited to just one event per year. There are also plenty tests conducted througout the year, some planned centrally, some done by individual teams. That's in addition to the regular training and exercises that SRE teams are doing periodically.

If you are interested in reading more about Google's approach to distaster planning and preparedness, you may be interested in reading the DiRT, or how to get dirty section from Shrinking the time to mitigate production incidents—CRE life lessons (https://cloud.google.com/blog/products/management-tools/shri...) and Weathering the Unexpected (https://queue.acm.org/detail.cfm?id=2371516).


Why not do both? ;)


Yup, they make a new chat app if the previous one is down.


Google Talk, Google Voice, Google Buzz, Google+ Messenger, Hangouts, Spaces, Allo, Hangouts Chat, and Google Messages.

At some point, they must run out of names, right?


You forgot google meet!


And Google Wave.


You forgot the chat boxes inside other apps like Google docs, Gmail, YouTube, etc.


And Google Pay, apparently.


> Yup, they make a new chat app if the previous one is down.

Continuous Deployment.


For those who don't know who he is: l9i would know this. Just clarifying that this is not an Internet nobody guessing.


He is still an anonymous dude to me.


HN Profile -> Personal Website -> LinkedIn -> Over 10 years experience as Google Site Reliability Engineer


Is the LinkedIn profile linking back to the hn account?


Security Engineer asking?


Ha, no. It just occured to me that any random hacker news account could link to somebody's personal account and claim authority on some subject.


Google SRE for 10 years, ending as the Principal Site Reliability Engineer (L8).


s/the//

Google has more than 1 L8 SRE.


I don't know who either he or you are, so...


I was clarifying his comment, since he didn't mention that this is not a guess, but inside knowledge.

I was not trying to establish a trust chain.

Take from it what you will.


Why does it matter if he's guessing or not?


Because, it may shock you to know, but sometimes people just go on the Internet and tell lies.

No shit Google has plans in place for outages.

But what are these plans, are they any good... a respected industry figure who's CV includes being at Google for 10 years doesn't need to go into detail describing the IRC fallback to be believed and trusted that there is such a thing.


I've found that when I post things I learned on the job here it actually causes people to tell me I'm wrong or made it up even more often…


It's kind of amusing given that employers are usually pretty easy to deduce based on comments…


That's just an 'appeal to authority'.

No-one knows or cares who made the statement, it may as well have been 'water is wet', it was useless and adds nothing but noise.


I found a comment that was factually incorrect and I felt competent to comment on that. Regrettably, I wrote just one sentence and clicked reply without providing any credentials to back up my claim. Not that I try to hide my identity, as danhak pointed out in https://news.ycombinator.com/item?id=28751644, my full name and URL of my personal website are only a click away.

I have replied to my initial comment with provide some additonal context: https://news.ycombinator.com/edit?id=28752431. Hope that helps.


That’s…not what “appeal to authority” means.


I've read here on HN that exactly this was the issue as they had one of the bigger outages (I think it was due to some auth service failure) and GMail didn't accept incoming mail.


A Gmail outage would be barely an inconvenience as Gmail plays a minor role in Google's disaster response.

Disclaimer: Ex-Googler who used to work on disaster reponse. Opinions are my own.


If you find Andreas' work inspiring and/or useful, you may consider supporting him in pursuing his passion:

- https://github.com/sponsors/awesomekling/

- https://www.patreon.com/serenityos

- https://www.paypal.com/paypalme/awesomekling

(source: https://awesomekling.github.io/about/)


Confirmed. I was hoping for the platypus but it turns out it was already taken anyway.

You can check out the whole O'Reilly menagerie at https://www.oreilly.com/animals.csp.

(disclaimer: I worked on the book)


In a somewhat snarky reply, I can assure you that the book release was planned long in advance, unlike the outages. ;)

(disclaimer: I worked on the book)


Ah, sorry then, my bad.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: