Comments, pictures, and links.

Thu 14 May 2015
  • Elevated latency and error rate for Google Compute Engine API - Google Groups
    "However, a software bug in the GCE control plane interacted poorly with this change and caused API requests directed to us-central1-a to be rejected starting at 03:21 PDT. Retries and timeouts from the failed calls caused increased load on other API backends, resulting in higher latency for all GCE API calls. The API issues were resolved when Google engineers identified the control plane issue and corrected it at 04:59 PDT, with the backlog fully cleared by 05:12 PDT. "
Wed 13 May 2015
  • Code Climate Status - Inaccurate Analysis Results
    "Also on May 8th, we deployed instrumentation and logging to track when our cached Git blob data did not match the actual contents on disk. We found no further mismatches on new analyses, supporting the theory that the issue was ephemeral and no longer present. Around this time we began a process of re-running old analyses that had failed, and were able to reproduce the issue. This was a critical learning, because it refuted the theory that the issue was ephemeral. With this information, we took a closer look at the objects in the analysis-level cache. We discovered that these marshaled Ruby objects did not in fact hold a reference to the contents of files as we originally believed. Problematically, the object held a reference to the Git service URL to use for remote procedure calls. When a repository was migrated, this cache key was untouched. This outdated reference led to cat-file calls being issued to the old server instead of the new server"
Fri 17 Apr 2015
  • Stack Exchange Network Status — Outage Postmortem: January 6th, 2015
    "With no way to get our main IP addresses accessible to most users, our options were to either fail over to our DR datacenter in read-only mode, or to enable CloudFlare - we’ve been testing using them for DDoS mitigation, and have separate ISP links in the NY datacenter which are dedicated to traffic from them. We decided to turn on CloudFlare, which caused a different problem - caused by our past selves."
  • DripStat — Post mortem of yesterday's outage
    "1. RackSpace had an outage in their Northern Virginia region. 2. We were getting DDOS’d. 3. The hypervisor Rackspace deployed our cloud server on was running into issue and would keep killing our java process. We were able to diagnose 2 and 3 only after Rackspace recovered from their long load balancer outage. The fact that all 3 happened at the same time did not help issues either."
  • Blog - Tideways
    "On Wednesday 6:05 Europe/Berlin time our Elasticsearch cluster went down when it ran OutOfMemory and file descriptors. One node of the cluster did not recover from this error anymore and the other responded to queries with failure. The workers processing the performance and trace event log data with Beanstalk message queue stopped. "
  • CopperEgg Status - Probe widgets not polling data
    "the primary of a redundant pair of data servers for one of our customer data clusters locked up hard in Amazon. An operations engineer responded to a pager alert and ensured failover had worked as designed; there was a brief period of probe delay on that cluster from the initial failover but service was only briefly interrupted and then the system was working fine. The failed server had to be hard rebooted and when it was, its data was corrupted and the server had to be rebuilt, and was then set up to resync its data with the live server. A manual error was made during the rebuild and replication was set up in an infinite loop. "
  • Freckle Time Tracking Status - Freckle is down
    "The underlying reason why nginx didn't start was that DNS was not working properly—nginx checks SSL certificates and it couldn't resolve one of the hosts needed to verify our main SSL certificate. We don't know why DNS didn't resolve, but it's likely that to the large number of booted servers in the Rackspace datacenter there was a temporary problem with DNS resolution requests.)"
  • Dead Man's Snitch — Postmortem: March 6th, 2015
    "On Friday, March 6th we had a major outage caused by a loss of historical data. During the outage we failed to alert on missed snitch check-ins and sent a large number of erroneous failure alerts for healthy snitches. It took 8 hours to restore or reconstruct all missing data and get our systems stabilized."
  • Travis CI Status - Slow .com build processing
    "Two runaway TLS connections inside our primary RabbitMQ node that were causing high CPU usage on that node. Once this was found, we deemed the high channel count a red herring and instead started work on the stuck connections."
  • Travis CI Status - Slow .com build processing
    "We looked at our metrics and quickly realised that our RabbitMQ instance had gone offline at 17:30 UTC. We tried to bring it back up, but it wouldn’t start up cleanly. One of the remediation actions after Tuesday’s RabbitMQ outage was to upgrade our cluster to run on more powerful servers, so we decided that instead of debugging why our current cluster wasn’t starting we’d perform emergency maintenance and spin up a new cluster."
  • Balanced Partial Outage Post Mortem - 2015-03-15
    Balanced experienced a partial outage that affected 25% of card processing transactions between 8:40AM and 9:42AM this morning due to a degraded machine which was not correctly removed from the load balancer. The core of the issue was in our secure vault system, which handles storage and retrieval of sensitive card data. One of the machines stopped sending messages, which cause some requests to be queued up but not processed but our automated health checks did not flag the machine as unhealthy.
Wed 04 Mar 2015
  • Postmortem: Storify downtime on March 2nd (with image) · storifydev · Storify
    "The problem was that we had one dropped index in our application code. This meant that whenever the new primary took the lead, the application asked to build that index. It was happening in the background, so it was kind of ok for the primary. But as soon as the primary finished, all the secondaries started building it in the foreground, which meant that our application couldn't reach MongoDB anymore."
Fri 20 Feb 2015
  • GCE instances are not reachable
    "ROOT CAUSE [PRELIMINARY] The internal software system which programs GCE’s virtual network for VM egress traffic stopped issuing updated routing information. The cause of this interruption is still under active investigation. Cached route information provided a defense in depth against missing updates, but GCE VM egress traffic started to be dropped as the cached routes expired. "

See older items in the stream archive.