My Nagios awoken 3 AM brain finds no fault in your logic.

rdl · on March 15, 2012

Even funnier are people doing server monitoring of (things in EC2) from within EC2. When the EC2 outage happens, there's obviously no problem because no alerts get sent...

Doh!

cperciva · on March 15, 2012

For some people, that might be fine. If you don't have plans for how to rapidly move out of EC2, you might as well just sleep through an all-of-EC2-goes-down outage for all you can do about it.

rdl · on March 15, 2012

You should at least know there is an outage to have something to tell your downstream customers. It is really embarrassing to have a customer (or your boss) call to report an outage you don't yet know about, even if there is fuck all you can do to resolve it. Basic principle of ops.

cperciva · on March 15, 2012

I wasn't being entirely serious. :-)

zackattack · on March 15, 2012

> Basic principle of ops

For my benefit, what are some others?

rdl · on March 15, 2012

This would actually be an interesting blog post.

pjscott · on March 15, 2012

This is why my sleepy 3AM brain was awoken by Pingdom. Hooray for having just enough redundancy to tell you that it's not quite enough.

Good night.

mickeymoose · on March 15, 2012

Me, i use specific load balancing for my trafic when Cloud outage is detected. And i sleep perfectly ;-)

sghael · on March 15, 2012

Could you give a little more detail on your setup? I'm curious how others are designing around these issues.

flojibi · on March 15, 2012

If my case can help you, my company uses services of one company for load-balancing trafic across multiple CDN/Cloud. We are no longer impacted by the failure of some providers. You can read this http://tinyurl.com/7pwfza7 (i'm user, not vendor)

18pfsmt · on March 15, 2012

I can't figure out why you people are using URL shorteners on HN, but I believe it is not looked upon well. So, for others, these links are as follows:

http://www.theregister.co.uk/2012/02/17/cedexis_and_the_open...

http://translate.google.fr/translate?hl=fr&sl=fr&tl=...

mickeymoose · on March 15, 2012

Very interesting flojibi. Another about multi cloud: http://bit.ly/zg37FQ

dsl · on March 15, 2012

Even funnier than that is watching the latency hit at Rackspace Cloud and Terremark as some non-trivial number of customers fail over.

rdl · on March 15, 2012

Do you work for a DNS provider or CDN or something (so as to see this in near realtime)? Envy.

I haven't seen a lot of people using both EC2 and Terremark for the same app -- kind of different markets. Not technically unreasonable, but Terremark seems to be more enterprise IT outsourcing, and EC2 (followed at very far remove by the other clouds, including Rackspace) being Internet-delivered consumer, etc. apps, or at least larger scale public services.

davidw · on March 15, 2012

Here's an idea I've thought about but don't have time to do anything with: a peer-to-peer monitoring network, so each new server on each new network makes it more robust. No idea how the details would work out.

rdl · on March 15, 2012

That gets done for network/application performance monitoring (alternatives to keynote, gomez, etc., and is how some of their own products work). It's kind of overkill for basic application level monitoring -- there's a tradeoff between number of endpoints checking and frequency of checks. I guess you could round-robin checks across a larger number of end nodes, too, to get both.