Comments, pictures, and links.
Diary of an outage | FastMail Weblog
"right at the top there was a complex series of folder renames within a single replication event. This is not a particularly unusual operation. This time it tripped a known, rare bug in the way renames are replicated that caused the replication process (sync_client) to abort. The Cyrus master daemon starts it up again, but then it hits the same point and dies again, over and over. Replication stops."
Incident Report: Amsterdam Data Center DNS Failure - DNSimple Blog
"Multiple factors appear to have been involved: A combination of a traffic spike of unknown origin and of questionable purpose, combined with a bottleneck in the DNSimple name server software, and what appears to be upstream resolution blocking. Additionally, our desire to contain the incident to a single data center may have prolonged the outage."
Google App Engine issues on February 20, 2014 - Google Groups
"The root cause of the outage was four overlapping network element failures, due to reasons varying from fiber cut to optical equipment device failure to network router failure. These failures were not related, and would statistically be expected to overlap to the degree observed only once every several years. "
Pivotal Web Services Status - API Outage
"We experienced a confirmed AWS s3 issue creating new buckets. The effect of this event was magnified by our attempts recreate an existing bucket. [...] We deployed a production fix to remove the dependency on recreating buckets."
Google Compute Engine Load Balancing Outage This Past Weekend - Google Groups
"there was an issue with Google Compute Engine’s load balancing control plane that was triggered when we began terminating instances in that zone. This prevented the load balancing service from creating new configurations"
Official Blog: Today’s outage for several Google services
"At 10:55 a.m. PST this morning, an internal system that generates configurations—essentially, information that tells other systems how to behave—encountered a software bug and generated an incorrect configuration. The incorrect configuration was sent to live services over the next 15 minutes"