There are always things to learn about distributed systems, especially how they can turn against you. Companies that publish postmortems are doing us a great favor, there is a jackpot of system design and operations knowledge to be gleaned from studying as many of these as you can get your hands on.
I’ve collected [now over 350] outage and security related postmortems in this Pinboard feed.
There’s no shortage of human-error examples in the collection. But better, there are many interesting (sometimes gripping) stories ranging from monitoring loops gone wild to freak hardware incidents to creeping issues undetectable in testing. Even an FBI raid.
I especially like to read the postmortems with “what went right” sections: a reminder of the constant operations vigilance and often invisible work that goes into keeping large-scale services healthy. We shouldn’t limit ourselves and only learn from mistakes, after all.
Which brings up another point (which I’d hope is self-evident but this is the internet.. so disclaimer time): I don’t collect these in order to shame anyone. If you’ve ever been involved in production services you know how much effort and attention goes into both the good and (inevitable) bad days. Instead of judging any of the decisions made in these situations, I’d just like to say thank you for being so open with us.
If you have any good ones to share, I’d love to read and tag them. (Email or tweet at me, thanks.)