Comments, pictures, and links.

Wed 09 Apr 2014
  • Google Cloud Platform API issues on April 8th, 2014 - Google Groups
    "The root cause of the outage was an error in the GAPI client JavaScript library, which caused the ‘apiVersion’ parameter in API calls to be set to “undefined” "Google engineers began a rollback to the previous version of the software package that includes the GAPI library. This process took approximately 3 hours 11 minutes to complete across the entire Google estate." "To prevent recurrences, Google engineers will review the GAPI JavaScript library testing and monitoring, and ensure that it is fully integrated with the release automation so that releases are detected and rolled back as necessary."
Fri 14 Mar 2014
  • GitHub: Denial of Service Attacks
    "After some investigation, we discovered that we were seeing several thousand HTTP requests per second distributed across thousands of IP addresses for a crafted URL. These requests were being sent to the non-SSL HTTP port and were then being redirected to HTTPS, which was consuming capacity in our load balancers and in our application tier. Unfortunately, we did not have a pre-configured way to block these requests and it took us a while to deploy a change to block them."
Thu 27 Feb 2014
  • Diary of an outage | FastMail Weblog
    "right at the top there was a complex series of folder renames within a single replication event. This is not a particularly unusual operation. This time it tripped a known, rare bug in the way renames are replicated that caused the replication process (sync_client) to abort. The Cyrus master daemon starts it up again, but then it hits the same point and dies again, over and over. Replication stops."
  • Incident Report: Amsterdam Data Center DNS Failure - DNSimple Blog
    "Multiple factors appear to have been involved: A combination of a traffic spike of unknown origin and of questionable purpose, combined with a bottleneck in the DNSimple name server software, and what appears to be upstream resolution blocking. Additionally, our desire to contain the incident to a single data center may have prolonged the outage."
Wed 26 Feb 2014
  • Google App Engine issues on February 20, 2014 - Google Groups
    "The root cause of the outage was four overlapping network element failures, due to reasons varying from fiber cut to optical equipment device failure to network router failure. These failures were not related, and would statistically be expected to overlap to the degree observed only once every several years. "
Fri 07 Feb 2014
  • Pivotal Web Services Status - API Outage
    "We experienced a confirmed AWS s3 issue creating new buckets. The effect of this event was magnified by our attempts recreate an existing bucket. [...] We deployed a production fix to remove the dependency on recreating buckets."
Archive

See older items in the stream archive.