Comments, pictures, and links.

Thu 11 Sep 2014
  • freistil IT » Post mortem: Network issues last week
    "our monitoring system started at about 10:10 UTC to alert us of network packet loss levels of 50% to 100% with a number of servers and a lot of failing service checks, which most of the times is a symptom of connectivity problems. We recognized quickly that most of the servers with bad connectivity were located in Hetzner datacenter #10. We also received Twitter posts from Hetzner customers whose servers were running in DC #10. This suggested a problem with a central network component, most probably a router or distribution switch."
  • Post Mortem - City Cloud
    "In a few minutes two nodes of two different replicating pairs experienced network failures. Still not a problem due to Gluster redundancy but clearly a sign of something not being right. While in discussions with Gluster to identify the cause one more node experiences network failure. This time in one of the pairs that already has a node offline. This causes all data located on that pair to become unavailable. "
  • Fog Creek System Status: May 5-6 Network Maintenance Post-Mortem
    "During the process of rearchitecting our switch fabric's spanning tree (moving from a more control-centric per-vlan spanning tree to a faster-failover rapid spanning tree, ironically to keep downtime to a minimum), we suddenly lost access to our equipment."
Sun 24 Aug 2014
  • Google App Engine Issues With Datastore OverQuota Errors Beginning August 5th, 2014 - Google Groups
    SUMMARY: On Tuesday 5 August and Wednesday 6 August 2014, some billed applications incorrectly received quota exceeded errors for a small number of requests. We sincerely apologize if your application was affected. DETAILED DESCRIPTION OF IMPACT: Between Tuesday 5 August 11:39 and Wednesday 6 August 19:05 US/Pacific, some applications incorrectly received quota exceeded errors. The incident predominantly affected Datastore API calls. On Tuesday 5 August, 0.2% of applications using the Datastore received some incorrect quota exceeded errors. On Wednesday 6 August, 0.8% of applications using the Datastore received some incorrect quota exceeded errors. On Tuesday 5 August, 0.001% of Datastore API calls failed with quota exceeded for affected applications. On Wednesday 6 August, 0.0005% of Datastore API calls failed for affected applications. ROOT CAUSE: The root cause of this incident was a transient failure of the component that handles quota checking. The component has been corrected. REMEDIATION AND PREVENTION: The incident was resolved when the issue that caused the transient errors went away. To prevent a recurrence of similar incidents, we have enabled additional logging in the affected components so that we can more quickly diagnose and resolve similar issues.
Sun 17 Aug 2014
  • The Upload Outage of July 29, 2014 « Strava Engineering
    "Although the range of signed integers goes from -2147483648 to 2147483647, only the positive portion of that range is available for auto-incrementing keys. At 15:10, the upper limit was hit and insertions into the table started failing."
Wed 23 Jul 2014
  • BBC Online Outage on Saturday 19th July 2014
    "At 9.30 on Saturday morning (19th July 2014) the load on the database went through the roof, meaning that many requests for metadata to the application servers started to fail. The immediate impact of this depended on how each product uses that data. In many cases the metadata is cached at the product level, and can continue to serve content while attempting to revalidate. In some cases (mostly older applications), the metadata is used directly, and so those products started to fail." "At almost the same time we had a second problem."
Wed 09 Jul 2014
  • The npm Blog — 2014-01-28 Outage Postmortem
    "While making a change to simplify the Varnish VCL config on Fastly, we added a bug that caused all requests to go to Manta, including those that should have gone to CouchDB. Since Manta doesn’t know how to handle requests like /pkgname, these all returned 403 Forbidden responses. Because Fastly is configured to not cache error codes, this proliferation of 403 responses led to a thundering herd which took a bit of time to get under control. With the help of the Fastly support team, we have identified the root cause and it is now well understood"
  • NY1 (Equinix) Power Issue Postmortem | DigitalOcean
    2013-11-25 "When the redundancy failed and another UPS did not take over, it essentially meant that power was cut off to equipment. UPS7 then hard rebooted and was back online, which then resumed the flow of power to equipment; however, there was an interruption of several minutes in between."
  • Stack Exchange Network Status — 2013-10-13 Outage PostMortem
    " A further loss of communication between the 2 nodes while Oregon is offline results in a quorum loss from the point of view of both members. To prevent a split-brain situation, the nodes enter an effective offline state when a loss of quorum occurs. When windows clustering observes a quorum loss, it initiates a state change of orphaned SQL resources (the availability groups the databases affected belong to). In the case of NY-SQL03 (the primary before the event), the databases were both not primary and not available since the AlwaysOn Availability Group was offline to prevent split brain"
  • NY2 Network Upgrade Postmortem | DigitalOcean
    2013 "On October 25th we observed a network interruption whereby the two core routers began flapping and their redundant protocol was not allowing either one to take over as the active device and push traffic out to our providers."
  • 2013-09-17 Outage Postmortem | AppNexus Tech Blog
    " the data update that caused the problem was a delete on a rarely-changed in-memory object. The result of the processing of the update is to unlink the deleted object from other objects, and schedule the object’s memory for deletion at what is expected to be a safe time in the future. This future time is a time when any thread that could have been using the old version at the time of update would no longer be using it. There was a bug in the code that deleted the object twice, and when it finally executed, it caused the crash. "
  • Downtime Postmortem: the nitty-gritty nerd story
    2013-09-11 "A few curious technically minded folks have wondered what exactly happened during our epic 30 hour downtime." "A bug in the virtualization software caused the system to loop on itself until it ran out of resources and stopped responding."
  • SIPB Outage 6/20/2013 Post-Mortem | Alex Chernyakhovsky
    "We just passed 10 years of cumulative uptime!" an XVM maintainer announced, celebrating how reliable the 8 servers that power the SIPB XVM Virtual Machine Service have been. Just an hour later, at 4:38 PM, Nagios alerted "Host DOWN PROBLEM alert for xvm!"
  • Our First Postmortem | The Circle Blog
    2013-05-21 "the sudden growth put a lot of strain on our system as a whole, and blew several of what might otherwise have been small problems into huge ones. We spent the last week of March and the first few weeks of April in almost full-time firefighting. Here’s what happened…"
  • A Post-Mortem on India's Blackout - IEEE Spectrum
    2012-08-06 "What set the stage for last week’s power outage in India, which left some 650 million people without electricity, was a widening rift between growing peak demand and the amount of generation available to meet that demand."
  • Foursquare outage post mortem - 2010-07-10
    "As many of you are aware, Foursquare had a significant outage this week. The outage was caused by capacity problems on one of the machines hosting the MongoDB database used for check-ins. This is an account of what happened, why it happened, how it can be prevented, and how 10gen is working to improve MongoDB in light of this outage."
  • Pagekite - Certificate expiration problem and postmortem
    "The root cause of this event was simply human error: we were aware that our certificates were expiring and had begun work on renewing them, but being somewhat distracted by the holidays, we didn't read our e-mail carefully enough and overlooked the fact that the front-end certificate was set to expire a couple of days earlier than the others. In order to prevent this problem from reoccurring, we are taking the following steps: Reducing the number of certificates in use by the service, to simplify management Improving our automated monitoring to monitor certificate expiration Improving our automated monitoring to monitor end-to-end service availability"
  • Lessons Learned from Skype’s 24-hr Outage
    [[Note: the original postmortem on Skype's blog seems to be missing now]] "On December 22nd, 1600 GMT, the Skype services started to become unavailable, in the beginning for a small part of the users, then for more and more, until the network was down for about 24 hours."
  • Heroku | Tuesday Postmortem 2010-10-27
    "A slowdown in our internal messaging systems caused a previously unknown bug in our distributed routing mesh to be triggered. This bug caused the routing mesh to fail. After isolating the bug, we attempted to roll back to a previous version of the routing mesh code. While the rollback solved the initial problem, there as an unexpected incompatibility between the routing mesh and our caching service. This incompatibility forced us to move back to the newer routing mesh code, which required us to perform a “hot patch” of the production system to fix the initial bug. "
  • Downtime Postmortem - @graysky
    "There was less than 15MB of free memory and little swap left due to the stale processes. Not good. Then I put on the straw that broke the camel's back. While trying to kill one of the stale processes, the machine locked up when it ran out of swap space. The Engine Yard configuration has the "app master" server double as both an application server and the load balancer, through haproxy, to the other application instances. This means that when that instance became unresponsive, the whole site went down. So now the clock is ticking (and I'm swearing to myself). Engine Yard's service noticed within 60 seconds that the app master was unresponsive. It automatically killed the existing app master instance, promoted one of the other app clones to be the master and created a fresh app instance to replace the clone. This worked smoothly, except for two issues."
  • Postmortem of today’s 8min indexing downtime » The Algolia Blog
    "This morning I fixed a rare bug in indexing complex hierarchical objects. This fix successfully passed all the tests after development. We have 6000+ unit tests and asserts, and 200+ non regression tests. So I felt confident when I entered the deploy password in our automatic deployment script. A few seconds after, I started to receive a lot of text messages on my cellphone."
  • Basecamp network attack postmortem
    "The attack was a combination of SYN flood, DNS reflection, ICMP flooding, and NTP amplification. The combined flow was in excess of 20Gbps. Our mitigation strategy included filtering through a single provider and working with them to remove bogus traffic. To reiterate, no data was compromised in this attack. This was solely an attack on our customers’ ability to access Basecamp and the other services. There are two main areas we will improve upon following this event. "
  • Supermarket HTTPS Redirect Postmortem | Chef Blog
    "we weren't enforcing HTTPS/SSL for the site. Yesterday, we deployed a change to enforce redirection from HTTP to HTTPS at the application level, which wound up loading a default Nginx page. This meant that the Supermarket was closed! We're sorry about that. Even though the site isn't considered production ("beta"!), we took this outage as seriously as any other."

See older items in the stream archive.