More

l9i · on July 1, 2024

Damien Miller

l9i · on Oct 27, 2023

> And I think Google has a backup IRC server on AWS, but that might just apocryphal.

They (>1) do exist but not on AWS or any other major cloud provider.

(Or at least that was the case two years ago when I still worked there.)

l9i · on Sept 1, 2023

If you enjoy Organic Maps, please consider supporting them financially: https://organicmaps.app/donate/. Supporting their serving infrastructure for numerous users is surely costly.

matkoniecz · on Sept 1, 2023

Right now infrastructure appears to be donated, but donations increase how much effort developers can spend on it!

l9i · on Nov 1, 2021

Most outdoors activities are safer and more pleasant when there is some sunlight (think: walking through a forest, hiking, swimming in a lake, sports, having a picnic, etc).

l9i · on Oct 4, 2021

The safe in question contained a smartcard required to boot an HSM. The safe combination was stored in a secret manager that depended on that HSM.

The engineer attempted to restart the service, but did not know that a restart required a hardware security module (HSM) smart card. These smart cards were stored in multiple safes in different Google offices across the globe, but not in New York City, where the on-call engineer was located. When the service failed to restart, the engineer contacted a colleague in Australia to retrieve a smart card. To their great dismay, the engineer in Australia could not open the safe because the combination was stored in the now-offline password manager.

Source: Chapter 1 of "Building Secure and Reliable Systems" (https://sre.google/static/pdf/building_secure_and_reliable_s... size warning: 9 MB)

brazzy · on Oct 4, 2021

Lovely.

Safes typically have the instructions on how to change the combination glued to the inside of the door, and ending with something like "store the combination securely. Not inside the safe!"

But as they say: make something foolproof and nature will create a better fool.

anigbrowl · on Oct 4, 2021

I'm sure this sort of thing won't be a problem for a company whose founding ethos is 'move fast and break things.' O:-)

l9i · on Oct 4, 2021

That's not quite how it happened. ;)

<shameless plug> We used this story as the opening of "Building Secure and Reliable Systems" (chapter 1). You can check it out for free at https://sre.google/static/pdf/building_secure_and_reliable_s... (size warning: 9 MB). </shameless plug>

l9i · on Oct 4, 2021

I can assure you that Google has a procedure in place for that.

l9i · on Oct 4, 2021

I unfortunately cannot edit the parent comment anymore but several people pointed out that I didn't back up my claim or provided any credentials so here they are:

Google has multiple independent procedures for coordination during disasters. A global DNS outage (mentioned in https://news.ycombinator.com/item?id=28751140) was considered and has been taken into account.

I do not attempt to hide my identity here, quite the opposite: my HN profile contains my real name. Until recently a part of my job was to ensure that Google is prepared for various disasterous scenarios and that Googlers can coordinate the response independently from Google's infrastructure. I authored one of the fallback communication procedures that would likely be exercised today if Google's network experienced a global outage. Of course Google has a whole team of fantastic human beings who are deeply involved in disaster preparedness (miss you!). I am pretty sure they are going to analyze what happened to Facebook today in light of Google's emergency plans.

While this topic is really fascinating, I am unfortunately not at liberty to disclose the details as they belong to my previous employer. But when I stumble upon factually incorrect comments on HN that I am in a position to correct, why not do that?

grayfaced · on Oct 5, 2021

In future news: Waymo outage results in engineers unable to get to data center. Engineers don't even know where their servers are.

shemnon42 · on Oct 4, 2021

Give us the dirt on how google does it's disaster planning exercises please! Do you do these exercises all at once or slowly over the year?

l9i · on Oct 5, 2021

Interesting that you are asking for the dirt given that DiRT stands for Disaster and Recovery Testing, at least at Google.

Every year there is a DiRT week where hundreds of tests are run. That obviously requires a ton of planning that starts well in advance. The objective is, of course, that despite all the testing nobody outside Google notices anything special. Given the volume and intrusiveness of these tests, the DiRT team is doing quite an impressive job.

While the DiRT week is the most intense testing period, disaster preparedness is not limited to just one event per year. There are also plenty tests conducted througout the year, some planned centrally, some done by individual teams. That's in addition to the regular training and exercises that SRE teams are doing periodically.

If you are interested in reading more about Google's approach to distaster planning and preparedness, you may be interested in reading the DiRT, or how to get dirty section from Shrinking the time to mitigate production incidents—CRE life lessons (https://cloud.google.com/blog/products/management-tools/shri...) and Weathering the Unexpected (https://queue.acm.org/detail.cfm?id=2371516).

donalhunt · on Oct 5, 2021

Why not do both? ;)

ric2b · on Oct 4, 2021

Yup, they make a new chat app if the previous one is down.

gadnuk · on Oct 4, 2021

Google Talk, Google Voice, Google Buzz, Google+ Messenger, Hangouts, Spaces, Allo, Hangouts Chat, and Google Messages.

At some point, they must run out of names, right?

andrepd · on Oct 4, 2021

You forgot google meet!

darkhorn · on Oct 4, 2021

And Google Wave.

londons_explore · on Oct 4, 2021

You forgot the chat boxes inside other apps like Google docs, Gmail, YouTube, etc.

scatters · on Oct 4, 2021

And Google Pay, apparently.

mr_toad · on Oct 5, 2021

> Yup, they make a new chat app if the previous one is down.

Continuous Deployment.

knorker · on Oct 4, 2021

For those who don't know who he is: l9i would know this. Just clarifying that this is not an Internet nobody guessing.

sam_lowry_ · on Oct 4, 2021

He is still an anonymous dude to me.

danhak · on Oct 4, 2021

HN Profile -> Personal Website -> LinkedIn -> Over 10 years experience as Google Site Reliability Engineer

ant6n · on Oct 4, 2021

Is the LinkedIn profile linking back to the hn account?

sam_lowry_ · on Oct 5, 2021

Security Engineer asking?

ant6n · on Oct 5, 2021

Ha, no. It just occured to me that any random hacker news account could link to somebody's personal account and claim authority on some subject.

e1g · on Oct 4, 2021

Google SRE for 10 years, ending as the Principal Site Reliability Engineer (L8).

sulam · on Oct 4, 2021

s/the//

Google has more than 1 L8 SRE.

jaywalk · on Oct 4, 2021

I don't know who either he or you are, so...

knorker · on Oct 4, 2021

I was clarifying his comment, since he didn't mention that this is not a guess, but inside knowledge.

I was not trying to establish a trust chain.

Take from it what you will.

astrange · on Oct 4, 2021

Why does it matter if he's guessing or not?

fragmede · on Oct 4, 2021

Because, it may shock you to know, but sometimes people just go on the Internet and tell lies.

No shit Google has plans in place for outages.

But what are these plans, are they any good... a respected industry figure who's CV includes being at Google for 10 years doesn't need to go into detail describing the IRC fallback to be believed and trusted that there is such a thing.

astrange · on Oct 5, 2021

I've found that when I post things I learned on the job here it actually causes people to tell me I'm wrong or made it up even more often…

saagarjha · on Oct 5, 2021

It's kind of amusing given that employers are usually pretty easy to deduce based on comments…

new_guy · on Oct 4, 2021

That's just an 'appeal to authority'.

No-one knows or cares who made the statement, it may as well have been 'water is wet', it was useless and adds nothing but noise.

l9i · on Oct 4, 2021

I found a comment that was factually incorrect and I felt competent to comment on that. Regrettably, I wrote just one sentence and clicked reply without providing any credentials to back up my claim. Not that I try to hide my identity, as danhak pointed out in https://news.ycombinator.com/item?id=28751644, my full name and URL of my personal website are only a click away.

I have replied to my initial comment with provide some additonal context: https://news.ycombinator.com/edit?id=28752431. Hope that helps.

heartbreak · on Oct 4, 2021

That’s…not what “appeal to authority” means.

still_grokking · on Oct 4, 2021

I've read here on HN that exactly this was the issue as they had one of the bigger outages (I think it was due to some auth service failure) and GMail didn't accept incoming mail.

l9i · on Oct 4, 2021

A Gmail outage would be barely an inconvenience as Gmail plays a minor role in Google's disaster response.

Disclaimer: Ex-Googler who used to work on disaster reponse. Opinions are my own.

l9i · on May 28, 2021

If you find Andreas' work inspiring and/or useful, you may consider supporting him in pursuing his passion:

- https://github.com/sponsors/awesomekling/

- https://www.patreon.com/serenityos

- https://www.paypal.com/paypalme/awesomekling

(source: https://awesomekling.github.io/about/)

l9i · on April 9, 2020

Confirmed. I was hoping for the platypus but it turns out it was already taken anyway.

You can check out the whole O'Reilly menagerie at https://www.oreilly.com/animals.csp.

(disclaimer: I worked on the book)

l9i · on April 9, 2020

In a somewhat snarky reply, I can assure you that the book release was planned long in advance, unlike the outages. ;)

(disclaimer: I worked on the book)

jxub · on April 9, 2020

Ah, sorry then, my bad.