This incident makes me think that services like Redis should support running with two sets of credentials at once in order to facilitate credential rolling. As it currently stands, rolling credentials is a rather big deal with a chance of things going wrong in the process.
Aside: the text on that page is extremely difficult to read because of poor contrast (#8584B2 on #282936) and might be impossible for people with vision impairment. If anyone from Heroku is reading, you should change the color scheme to be compliant with the W3C Web Content Accessibility Guidelines. See: http://www.snook.ca/technical/colour_contrast/colour.html
I thought the exact same thing, but then I realized that we don’t need Redis to actually support a choice of two passwords for a single account. Rather, clients can be configured with a list of credentials to try. When rolling credentials over, simply add the new ones to the clients’ lists, update the service, then remove the old ones from the clients. Then you can wait hours or days between steps for safety, and there is no time when system reliability is degraded by a service instance being inaccessible.
To your aside: I for one really appreciate Heroku's color scheme. I find high contrast color schemes very fatiguing, and prefer light-ish on dark-ish color schemes like Heroku's. I don't really believe in a universally ideal color scheme. I think we should instead focus on building and supporting tooling to help people adapt content to their needs.
Anyone with serious visual impairments almost certainly already has a user stylesheet installed in their browser of choice.
I am a person with reasonably decent vision who is fatigued by high-contrast color schemes. I greatly appreciate grey-on-black and other such color schemes that make it so I'm not staring into a lightbulb for ten-to-twelve hours a day. (Yes, my monitor brightness is set to a reasonable level. Yes, black on white is still far more fatiguing than white on black.)
Because I so rarely feel compelled to say this: this is a really great post-mortem. It's technical, it's not loaded down with sales-speak, and it's straightforward. I really hope post-mortems like this become more of a trend.
This paragraph reads like a response to the criticism they received a few days ago for scheduling maintenance at 2pm PST:
On June 23rd we performed a credential roll on these Redis servers in our US cloud during a two hour scheduled maintenance window. Because we operate a service used globally, there is a less-than 10% difference in usage between so-called "peak hours" and “non-peak” hours. We scheduled maintenance for this time because it was not a peak time, but moreso because this period has high coverage from relevant engineering teams, should issues arise. By performing maintenance during this period, we were able to react more quickly and muster those teams within seconds.
So, doing maintenance on US servers, on a Friday, during a timeframe when the US is getting off work is not a peak period of time for those servers? That sounds a bit like using numbers to lie, or at least minimalize the impact.
I can see how it might have been a lull when considering all Heroku instances (since most of Europe is headed off to bed), but I have the feeling this was not the case for the US servers. I could certainly be wrong, since I'm just speculating, but it just seems fishy.
Heroku only has two regions, Europe and US. Many applications not targetting the American market are still hosted on the US region. For instance, I live in Japan, and all the heroku applications I know of are hosted in the US region. Additionally, as I understand it, most of Heroku's resources are used by a small number of large, international services, who's usage times don't necessarily correspond to USA hours.
> We are reviewing our internal processes to ensure that communication between groups is more effective, so that we can better inform our customers when situations occur.
I see this as the only contentious point raised by some of their users. They are doing an outstanding job already at dealing with a large infrastructure running a wide range of heterogeneous applications. They likely run updates on their infrastructure on a regular basis, without anybody noticing.
However, if you're selling me on the promise of taking care of infrastructure for me, you can't under-deliver on communicating as soon as st hits the fan.
I've been developing a Node site that is currently running on Heroku. This happened the first day after launch, and to say the least my blood pressure was through the roof all day. I was terrified if something went wrong, we would be dead in the water. I don't think I would deal with them again (if I had the chance).
I am curious as to why they were relying on rolling Redis credentials at all since they would have needed to pre-arrange a secure channel for Redis traffic anyway.
Aside: the text on that page is extremely difficult to read because of poor contrast (#8584B2 on #282936) and might be impossible for people with vision impairment. If anyone from Heroku is reading, you should change the color scheme to be compliant with the W3C Web Content Accessibility Guidelines. See: http://www.snook.ca/technical/colour_contrast/colour.html