Comments, pictures, and links.
Postmortem: Storify downtime on March 2nd (with image) · storifydev · Storify
"The problem was that we had one dropped index in our application code. This meant that whenever the new primary took the lead, the application asked to build that index. It was happening in the background, so it was kind of ok for the primary. But as soon as the primary finished, all the secondaries started building it in the foreground, which meant that our application couldn't reach MongoDB anymore."
GCE instances are not reachable
"ROOT CAUSE [PRELIMINARY] The internal software system which programs GCE’s virtual network for VM egress traffic stopped issuing updated routing information. The cause of this interruption is still under active investigation. Cached route information provided a defense in depth against missing updates, but GCE VM egress traffic started to be dropped as the cached routes expired. "
FAQ about the recent FBI raid (Pinboard Blog)
"Why did the FBI take a Pinboard server? I don't know. As best I can tell, the FBI was after someone else whose server was in physical proximity to ours. "
A Note on Recent Downtime (Pinboard Blog)
"Of course I was wrong about that, and my web hosts pulled the plug early in the morning on the 2nd. Bookmarks and archives were not affected, but I neglected to do a final sync of notes (notes in pinboard are saved as files). This meant about 20 users who created or edited notes between December 31 and Jan 2 lost those notes."
Recent Bounciness And When It Will Stop (Pinboard Blog)
"Over the past week there have been a number of outages, ranging in length from a few seconds to a couple of hours. Until recently, Pinboard has had a good track record of uptime, and like my users I find this turn of events distressing. I'd like to share what I know so far about the problem, and what steps I'm taking to fix it."
Outage This Morning (Pinboard Blog)
"The root cause of the outage appears to have been a disk error. The server entered a state where nothing could write to disk, crashing the database. We were able to reboot the server, but then had to wait a long time for it to repair the filesystem."
Second Outage (Pinboard Blog)
"The main filesystem on our web server suddenly went into read-only mode, crashing the database. Once again I moved all services to the backup machine while the main server went through its long disk check."
API Outage (Pinboard Blog)
"Pinboard servers came under DDOS attack today and the colocation facility (Datacate) has insisted on taking the affected IP addresses offline for 48 hours. In my mind, this accomplishes the goal of the denial of service attack, but I am just a simple web admin. I've moved the main site to a secondary server and will do the same for the APi in the morning (European time) when there's less chance of me screwing it up. Until then the API will be unreachable."
A Bad Privacy Bug (Pinboard Blog)
" tl;dr: because of poor input validation and a misdesigned schema, bookmarks could be saved in a way that made them look private to the ORM, but public to the database. Testing failed to catch the error because it was done from a non-standard account.. There are several changes I will make to prevent this class of problem from recurring: Coerce all values to the expected types at the time they are saved to the database, rather than higher in the call stack. Add assertions to the object loader so it complains to the error log if it sees unexpected values. Add checks to the templating code to prevent public bookmarks showing up under any circumstances on certain public-only pages. Run deployment tests from a non-privileged account."
Facebook & Instagram API servers down
Not much detail there. Config change? Security? See: https://blog.thousandeyes.com/facebook-outage-deep-dive/ Also: "Facebook Inc. on Tuesday denied being the victim of a hacking attack and said its site and photo-sharing app Instagram had suffered an outage after it introduced a configuration change."
Final Root Cause Analysis and Improvement Areas: Nov 18 Azure Storage Service Interruption | Microsoft Azure Blog
"1. The standard flighting deployment policy of incrementally deploying changes across small slices was not followed. [...] 2. Although validation in test and pre-production had been done against Azure Table storage Front-Ends, the configuration switch was incorrectly enabled for Azure Blob storage Front-Ends."