Comments, pictures, and links.

Fri 17 Apr 2015
  • Stack Exchange Network Status — Outage Postmortem: January 6th, 2015
    "With no way to get our main IP addresses accessible to most users, our options were to either fail over to our DR datacenter in read-only mode, or to enable CloudFlare - we’ve been testing using them for DDoS mitigation, and have separate ISP links in the NY datacenter which are dedicated to traffic from them. We decided to turn on CloudFlare, which caused a different problem - caused by our past selves."
  • DripStat — Post mortem of yesterday's outage
    "1. RackSpace had an outage in their Northern Virginia region. 2. We were getting DDOS’d. 3. The hypervisor Rackspace deployed our cloud server on was running into issue and would keep killing our java process. We were able to diagnose 2 and 3 only after Rackspace recovered from their long load balancer outage. The fact that all 3 happened at the same time did not help issues either."
  • Blog - Tideways
    "On Wednesday 6:05 Europe/Berlin time our Elasticsearch cluster went down when it ran OutOfMemory and file descriptors. One node of the cluster did not recover from this error anymore and the other responded to queries with failure. The workers processing the performance and trace event log data with Beanstalk message queue stopped. "
  • CopperEgg Status - Probe widgets not polling data
    "the primary of a redundant pair of data servers for one of our customer data clusters locked up hard in Amazon. An operations engineer responded to a pager alert and ensured failover had worked as designed; there was a brief period of probe delay on that cluster from the initial failover but service was only briefly interrupted and then the system was working fine. The failed server had to be hard rebooted and when it was, its data was corrupted and the server had to be rebuilt, and was then set up to resync its data with the live server. A manual error was made during the rebuild and replication was set up in an infinite loop. "
  • Freckle Time Tracking Status - Freckle is down
    "The underlying reason why nginx didn't start was that DNS was not working properly—nginx checks SSL certificates and it couldn't resolve one of the hosts needed to verify our main SSL certificate. We don't know why DNS didn't resolve, but it's likely that to the large number of booted servers in the Rackspace datacenter there was a temporary problem with DNS resolution requests.)"
  • Dead Man's Snitch — Postmortem: March 6th, 2015
    "On Friday, March 6th we had a major outage caused by a loss of historical data. During the outage we failed to alert on missed snitch check-ins and sent a large number of erroneous failure alerts for healthy snitches. It took 8 hours to restore or reconstruct all missing data and get our systems stabilized."
  • Travis CI Status - Slow .com build processing
    "Two runaway TLS connections inside our primary RabbitMQ node that were causing high CPU usage on that node. Once this was found, we deemed the high channel count a red herring and instead started work on the stuck connections."
  • Travis CI Status - Slow .com build processing
    "We looked at our metrics and quickly realised that our RabbitMQ instance had gone offline at 17:30 UTC. We tried to bring it back up, but it wouldn’t start up cleanly. One of the remediation actions after Tuesday’s RabbitMQ outage was to upgrade our cluster to run on more powerful servers, so we decided that instead of debugging why our current cluster wasn’t starting we’d perform emergency maintenance and spin up a new cluster."
  • Balanced Partial Outage Post Mortem - 2015-03-15
    Balanced experienced a partial outage that affected 25% of card processing transactions between 8:40AM and 9:42AM this morning due to a degraded machine which was not correctly removed from the load balancer. The core of the issue was in our secure vault system, which handles storage and retrieval of sensitive card data. One of the machines stopped sending messages, which cause some requests to be queued up but not processed but our automated health checks did not flag the machine as unhealthy.
Wed 04 Mar 2015
  • Postmortem: Storify downtime on March 2nd (with image) · storifydev · Storify
    "The problem was that we had one dropped index in our application code. This meant that whenever the new primary took the lead, the application asked to build that index. It was happening in the background, so it was kind of ok for the primary. But as soon as the primary finished, all the secondaries started building it in the foreground, which meant that our application couldn't reach MongoDB anymore."
Fri 20 Feb 2015
  • GCE instances are not reachable
    "ROOT CAUSE [PRELIMINARY] The internal software system which programs GCE’s virtual network for VM egress traffic stopped issuing updated routing information. The cause of this interruption is still under active investigation. Cached route information provided a defense in depth against missing updates, but GCE VM egress traffic started to be dropped as the cached routes expired. "
Wed 04 Feb 2015
  • A Note on Recent Downtime (Pinboard Blog)
    "Of course I was wrong about that, and my web hosts pulled the plug early in the morning on the 2nd. Bookmarks and archives were not affected, but I neglected to do a final sync of notes (notes in pinboard are saved as files). This meant about 20 users who created or edited notes between December 31 and Jan 2 lost those notes."
  • Recent Bounciness And When It Will Stop (Pinboard Blog)
    "Over the past week there have been a number of outages, ranging in length from a few seconds to a couple of hours. Until recently, Pinboard has had a good track record of uptime, and like my users I find this turn of events distressing. I'd like to share what I know so far about the problem, and what steps I'm taking to fix it."
  • Outage This Morning (Pinboard Blog)
    "The root cause of the outage appears to have been a disk error. The server entered a state where nothing could write to disk, crashing the database. We were able to reboot the server, but then had to wait a long time for it to repair the filesystem."
  • Second Outage (Pinboard Blog)
    "The main filesystem on our web server suddenly went into read-only mode, crashing the database. Once again I moved all services to the backup machine while the main server went through its long disk check."
  • API Outage (Pinboard Blog)
    "Pinboard servers came under DDOS attack today and the colocation facility (Datacate) has insisted on taking the affected IP addresses offline for 48 hours. In my mind, this accomplishes the goal of the denial of service attack, but I am just a simple web admin. I've moved the main site to a secondary server and will do the same for the APi in the morning (European time) when there's less chance of me screwing it up. Until then the API will be unreachable."
  • A Bad Privacy Bug (Pinboard Blog)
    " tl;dr: because of poor input validation and a misdesigned schema, bookmarks could be saved in a way that made them look private to the ORM, but public to the database. Testing failed to catch the error because it was done from a non-standard account.. There are several changes I will make to prevent this class of problem from recurring: Coerce all values to the expected types at the time they are saved to the database, rather than higher in the call stack. Add assertions to the object loader so it complains to the error log if it sees unexpected values. Add checks to the templating code to prevent public bookmarks showing up under any circumstances on certain public-only pages. Run deployment tests from a non-privileged account."
Mon 02 Feb 2015
  • Facebook & Instagram API servers down
    Not much detail there. Config change? Security? See: https://blog.thousandeyes.com/facebook-outage-deep-dive/ Also: "Facebook Inc. on Tuesday denied being the victim of a hacking attack and said its site and photo-sharing app Instagram had suffered an outage after it introduced a configuration change."
Archive

See older items in the stream archive.