Comments, pictures, and links.
Final Root Cause Analysis and Improvement Areas: Nov 18 Azure Storage Service Interruption | Microsoft Azure Blog
"1. The standard flighting deployment policy of incrementally deploying changes across small slices was not followed. [...] 2. Although validation in test and pre-production had been done against Azure Table storage Front-Ends, the configuration switch was incorrectly enabled for Azure Blob storage Front-Ends."
Incident Report - DDoS Attack - DNSimple Blog
"A new customer signed up for our service and brought in multiple domains that were already facing a DDoS attack. The customer had already tried at least 2 other providers before DNSimple. Once the domains were delegated to us, we began receiving the traffic from the DDoS. DNSimple was not the target of the attack, nor were any of our other customers. The volume of the attack was approximately 25gb/s sustained traffic across our networks, with around 50 million packets per second. In this case, the traffic was sufficient enough to overwhelm the 4 DDoS devices we had placed in our data centers after a previous attack (there is also a 5th device, but it was not yet online in our network)."
craigslist DNS Outage | craigslist blog
"At approximately 5pm PST Sunday evening the craigslist domain name service (DNS) records maintained at one of our domain registrars were compromised, diverting users to various non-craigslist sites. This issue has been corrected at the source, but many internet service providers (ISPs) cached the false DNS information for several hours, and some may still have incorrect information."
Update on Azure Storage Service Interruption | Microsoft Azure Blog
" Prior to applying the performance update, it had been tested over several weeks in a subset of our customer-facing storage service for Azure Tables. We typically call this “flighting,” as we work to identify issues before we broadly deploy any updates. The flighting test demonstrated a notable performance improvement and we proceeded to deploy the update across the storage service. During the rollout we discovered an issue that resulted in storage blob front ends going into an infinite loop, which had gone undetected during flighting."
Anatomy of a Crushing (Pinboard Blog)
"The bad news was that it had never occurred to me to test the database under write load. Now, I can see the beardos out there shaking their heads. But in my defense, heavy write loads seemed like the last thing Pinboard would ever face. It was my experience that people approached an online purchase of six dollars with the same deliberation and thoughtfulness they might bring to bear when buying a new car. Prospective users would hand-wring for weeks on Twitter and send us closely-worded, punctilious lists of questions before creating an account. The idea that we might someday have to worry about write throughput never occurred to me. If it had, I would have thought it a symptom of nascent megalomania. "
The network nightmare that ate my week
"I have come to the conclusion that so much in IPv6 design and implementation has been botched by protocol designers and vendors (both ours and others) that it is simply unsafe to run IPv6 on a production network except in very limited geographical circumstances and with very tight central administration of hosts."
Inherent Complexity of the Cloud: VS Online Outage Postmortem
"it appears that the outage is at least due in part to some license checks that had been improperly disabled, causing unnecessary traffic to be generated. Adding to the confusion (and possible causes) was the observation of “…a spike in latencies and failed deliveries of Service Bus messages”"
Stack Exchange Network Status — Outage Post-Mortem: August 25th, 2014
"a misleading comment in the iptables configuration led us to make a harmful change. The change had the effect of preventing the HAProxy systems from being able to complete a connection to our IIS web servers - the response traffic for those connections (the SYN/ACK packet) was suddenly being blocked."
Morgue: Helping Better Understand Events by Building a Post Mortem Tool - Bethany Macri on Vimeo
"My talk will be about why myself and another engineer built an internal post mortem tool called Morgue and the effect that the tool has had on our organization. Morgue formalized and systematized the way [my company] as a whole runs post mortems by focusing both the leader and the attendees of the post mortem on the most important aspects of resolving and understanding the event in a consistent way. In addition, the tool has facilitated relations between Ops and Engineers by increasing the awareness of Ops’ involvement in an outage and also by making all of the post mortems easily available to anyone in the organization. Lastly, all of our developers have access to the Morgue repository and have continued to develop features for the tool as improvements for conducting a post mortem have been suggested."
Contributors Section of Supermarket Disabled – Postmortem Meeting | Chef Blog
"At Chef, we conduct postmortem meetings for outages and issues with the site and services. Since Supermarket belongs to the community, and we are developing the application in the open, we would like to invite you, the community, to listen in or participate in public postmortem meetings for these outages."