Comments, pictures, and links.

Sun 17 Aug 2014
  • The Upload Outage of July 29, 2014 « Strava Engineering
    "Although the range of signed integers goes from -2147483648 to 2147483647, only the positive portion of that range is available for auto-incrementing keys. At 15:10, the upper limit was hit and insertions into the table started failing."
Wed 23 Jul 2014
  • BBC Online Outage on Saturday 19th July 2014
    "At 9.30 on Saturday morning (19th July 2014) the load on the database went through the roof, meaning that many requests for metadata to the application servers started to fail. The immediate impact of this depended on how each product uses that data. In many cases the metadata is cached at the product level, and can continue to serve content while attempting to revalidate. In some cases (mostly older applications), the metadata is used directly, and so those products started to fail." "At almost the same time we had a second problem."
Wed 09 Jul 2014
  • The npm Blog — 2014-01-28 Outage Postmortem
    "While making a change to simplify the Varnish VCL config on Fastly, we added a bug that caused all requests to go to Manta, including those that should have gone to CouchDB. Since Manta doesn’t know how to handle requests like /pkgname, these all returned 403 Forbidden responses. Because Fastly is configured to not cache error codes, this proliferation of 403 responses led to a thundering herd which took a bit of time to get under control. With the help of the Fastly support team, we have identified the root cause and it is now well understood"
  • NY1 (Equinix) Power Issue Postmortem | DigitalOcean
    2013-11-25 "When the redundancy failed and another UPS did not take over, it essentially meant that power was cut off to equipment. UPS7 then hard rebooted and was back online, which then resumed the flow of power to equipment; however, there was an interruption of several minutes in between."
  • Stack Exchange Network Status — 2013-10-13 Outage PostMortem
    " A further loss of communication between the 2 nodes while Oregon is offline results in a quorum loss from the point of view of both members. To prevent a split-brain situation, the nodes enter an effective offline state when a loss of quorum occurs. When windows clustering observes a quorum loss, it initiates a state change of orphaned SQL resources (the availability groups the databases affected belong to). In the case of NY-SQL03 (the primary before the event), the databases were both not primary and not available since the AlwaysOn Availability Group was offline to prevent split brain"
  • NY2 Network Upgrade Postmortem | DigitalOcean
    2013 "On October 25th we observed a network interruption whereby the two core routers began flapping and their redundant protocol was not allowing either one to take over as the active device and push traffic out to our providers."
  • 2013-09-17 Outage Postmortem | AppNexus Tech Blog
    " the data update that caused the problem was a delete on a rarely-changed in-memory object. The result of the processing of the update is to unlink the deleted object from other objects, and schedule the object’s memory for deletion at what is expected to be a safe time in the future. This future time is a time when any thread that could have been using the old version at the time of update would no longer be using it. There was a bug in the code that deleted the object twice, and when it finally executed, it caused the crash. "
  • Downtime Postmortem: the nitty-gritty nerd story
    2013-09-11 "A few curious technically minded folks have wondered what exactly happened during our epic 30 hour downtime." "A bug in the virtualization software caused the system to loop on itself until it ran out of resources and stopped responding."
  • SIPB Outage 6/20/2013 Post-Mortem | Alex Chernyakhovsky
    "We just passed 10 years of cumulative uptime!" an XVM maintainer announced, celebrating how reliable the 8 servers that power the SIPB XVM Virtual Machine Service have been. Just an hour later, at 4:38 PM, Nagios alerted "Host DOWN PROBLEM alert for xvm!"
  • Our First Postmortem | The Circle Blog
    2013-05-21 "the sudden growth put a lot of strain on our system as a whole, and blew several of what might otherwise have been small problems into huge ones. We spent the last week of March and the first few weeks of April in almost full-time firefighting. Here’s what happened…"
  • A Post-Mortem on India's Blackout - IEEE Spectrum
    2012-08-06 "What set the stage for last week’s power outage in India, which left some 650 million people without electricity, was a widening rift between growing peak demand and the amount of generation available to meet that demand."
  • Foursquare outage post mortem - 2010-07-10
    "As many of you are aware, Foursquare had a significant outage this week. The outage was caused by capacity problems on one of the machines hosting the MongoDB database used for check-ins. This is an account of what happened, why it happened, how it can be prevented, and how 10gen is working to improve MongoDB in light of this outage."
  • Pagekite - Certificate expiration problem and postmortem
    "The root cause of this event was simply human error: we were aware that our certificates were expiring and had begun work on renewing them, but being somewhat distracted by the holidays, we didn't read our e-mail carefully enough and overlooked the fact that the front-end certificate was set to expire a couple of days earlier than the others. In order to prevent this problem from reoccurring, we are taking the following steps: Reducing the number of certificates in use by the service, to simplify management Improving our automated monitoring to monitor certificate expiration Improving our automated monitoring to monitor end-to-end service availability"
  • Lessons Learned from Skype’s 24-hr Outage
    [[Note: the original postmortem on Skype's blog seems to be missing now]] "On December 22nd, 1600 GMT, the Skype services started to become unavailable, in the beginning for a small part of the users, then for more and more, until the network was down for about 24 hours."
  • Heroku | Tuesday Postmortem 2010-10-27
    "A slowdown in our internal messaging systems caused a previously unknown bug in our distributed routing mesh to be triggered. This bug caused the routing mesh to fail. After isolating the bug, we attempted to roll back to a previous version of the routing mesh code. While the rollback solved the initial problem, there as an unexpected incompatibility between the routing mesh and our caching service. This incompatibility forced us to move back to the newer routing mesh code, which required us to perform a “hot patch” of the production system to fix the initial bug. "
  • Downtime Postmortem - @graysky
    "There was less than 15MB of free memory and little swap left due to the stale processes. Not good. Then I put on the straw that broke the camel's back. While trying to kill one of the stale processes, the machine locked up when it ran out of swap space. The Engine Yard configuration has the "app master" server double as both an application server and the load balancer, through haproxy, to the other application instances. This means that when that instance became unresponsive, the whole site went down. So now the clock is ticking (and I'm swearing to myself). Engine Yard's service noticed within 60 seconds that the app master was unresponsive. It automatically killed the existing app master instance, promoted one of the other app clones to be the master and created a fresh app instance to replace the clone. This worked smoothly, except for two issues."
  • Postmortem of today’s 8min indexing downtime » The Algolia Blog
    "This morning I fixed a rare bug in indexing complex hierarchical objects. This fix successfully passed all the tests after development. We have 6000+ unit tests and asserts, and 200+ non regression tests. So I felt confident when I entered the deploy password in our automatic deployment script. A few seconds after, I started to receive a lot of text messages on my cellphone."
  • Basecamp network attack postmortem
    "The attack was a combination of SYN flood, DNS reflection, ICMP flooding, and NTP amplification. The combined flow was in excess of 20Gbps. Our mitigation strategy included filtering through a single provider and working with them to remove bogus traffic. To reiterate, no data was compromised in this attack. This was solely an attack on our customers’ ability to access Basecamp and the other services. There are two main areas we will improve upon following this event. "
  • Supermarket HTTPS Redirect Postmortem | Chef Blog
    "we weren't enforcing HTTPS/SSL for the site. Yesterday, we deployed a change to enforce redirection from HTTP to HTTPS at the application level, which wound up loading a default Nginx page. This meant that the Supermarket was closed! We're sorry about that. Even though the site isn't considered production ("beta"!), we took this outage as seriously as any other."
Thu 03 Jul 2014
  • When your backup isn't a backup: a postmortem
    "Early yesterday, we started a query to remove old data from the content database, using a full table scan update and delete. We had safely performed the same query many times before, but thanks to a recent build the database was larger, and we were closer to the physical limits of the disk. When doing a full table update, postgres can sometimes require as much free space as used space to successfully complete a query. Ordinarily this isn’t a problem, but because we were lower on free space the query behaved unexpectedly, and we lost a large fraction of the content before we were able to stop the process. Because the error occured during a modification to the production database, our only options were a full rebuild (~3 days) or a restore from backup (faster, but still many hours)."
Wed 25 Jun 2014
  • Heroku Incident 642
    "When we change the credentials on a Redis server, the dyno manager and the runtime agents talking to it should failover to the unchanged servers until they’ve been reconfigured to use the new credentials. Due to incorrect operational documentation, we failed to perform this procedure correctly, and changed the credentials on all four Redis servers at the same time."
Archive

See older items in the stream archive.