Comments, pictures, and links.
Travis CI Status - Elevated wait times and timeouts for OSX builds
"We noticed an error pointing towards build VM boot timeouts at 23:21 UTC on the 20th. After discussion with our infrastructure provider, it was shown to us that our SAN (a NetApp appliance) was being overloaded due to a spike in disk operations per second."
CircleCI Status - DB performance issue
"The degradation in DB performance was a special kind of non-linear, going from "everything is fine" to "fully unresponsive" within 2 minutes. Symptoms included a long list of queued builds, and each query taking a massive amount of time to run, along side many queries timing out."
NYSE Blames Trading Outage on Software Upgrade | Traders Magazine Online News
"On Tuesday evening, the NYSE began the rollout of a software release in preparation for the July 11 industry test of the upcoming SIP timestamp requirement. As is standard NYSE practice, the initial release was deployed on one trading unit. As customers began connecting after 7am on Wednesday morning, there were communication issues between customer gateways and the trading unit with the new release. It was determined that the NYSE and NYSE MKT customer gateways were not loaded with the proper configuration compatible with the new release."
Elevated latency and error rate for Google Compute Engine API - Google Groups
"However, a software bug in the GCE control plane interacted poorly with this change and caused API requests directed to us-central1-a to be rejected starting at 03:21 PDT. Retries and timeouts from the failed calls caused increased load on other API backends, resulting in higher latency for all GCE API calls. The API issues were resolved when Google engineers identified the control plane issue and corrected it at 04:59 PDT, with the backlog fully cleared by 05:12 PDT. "
Code Climate Status - Inaccurate Analysis Results
"Also on May 8th, we deployed instrumentation and logging to track when our cached Git blob data did not match the actual contents on disk. We found no further mismatches on new analyses, supporting the theory that the issue was ephemeral and no longer present. Around this time we began a process of re-running old analyses that had failed, and were able to reproduce the issue. This was a critical learning, because it refuted the theory that the issue was ephemeral. With this information, we took a closer look at the objects in the analysis-level cache. We discovered that these marshaled Ruby objects did not in fact hold a reference to the contents of files as we originally believed. Problematically, the object held a reference to the Git service URL to use for remote procedure calls. When a repository was migrated, this cache key was untouched. This outdated reference led to cat-file calls being issued to the old server instead of the new server"