Comments, pictures, and links.
EC2 Maintenance Update II
I'd like to give you an update on the EC2 Maintenance announcement that I posted last week. Late yesterday (September 30th), we completed a reboot of less than 10% of the EC2 fleet to protect you from any security risks associated with the Xen Security Advisory (XSA-108). This Xen Security Advisory was embargoed until a few minutes ago; we were obligated to keep all information about the issue confidential until it was published. The Xen community (in which we are active participants) has designed a two-stage disclosure process that operates as follows:
freistil IT » Post mortem: Network issues last week
"our monitoring system started at about 10:10 UTC to alert us of network packet loss levels of 50% to 100% with a number of servers and a lot of failing service checks, which most of the times is a symptom of connectivity problems. We recognized quickly that most of the servers with bad connectivity were located in Hetzner datacenter #10. We also received Twitter posts from Hetzner customers whose servers were running in DC #10. This suggested a problem with a central network component, most probably a router or distribution switch."
Post Mortem - City Cloud
"In a few minutes two nodes of two different replicating pairs experienced network failures. Still not a problem due to Gluster redundancy but clearly a sign of something not being right. While in discussions with Gluster to identify the cause one more node experiences network failure. This time in one of the pairs that already has a node offline. This causes all data located on that pair to become unavailable. "
Fog Creek System Status: May 5-6 Network Maintenance Post-Mortem
"During the process of rearchitecting our switch fabric's spanning tree (moving from a more control-centric per-vlan spanning tree to a faster-failover rapid spanning tree, ironically to keep downtime to a minimum), we suddenly lost access to our equipment."
Google App Engine Issues With Datastore OverQuota Errors Beginning August 5th, 2014 - Google Groups
SUMMARY: On Tuesday 5 August and Wednesday 6 August 2014, some billed applications incorrectly received quota exceeded errors for a small number of requests. We sincerely apologize if your application was affected. DETAILED DESCRIPTION OF IMPACT: Between Tuesday 5 August 11:39 and Wednesday 6 August 19:05 US/Pacific, some applications incorrectly received quota exceeded errors. The incident predominantly affected Datastore API calls. On Tuesday 5 August, 0.2% of applications using the Datastore received some incorrect quota exceeded errors. On Wednesday 6 August, 0.8% of applications using the Datastore received some incorrect quota exceeded errors. On Tuesday 5 August, 0.001% of Datastore API calls failed with quota exceeded for affected applications. On Wednesday 6 August, 0.0005% of Datastore API calls failed for affected applications. ROOT CAUSE: The root cause of this incident was a transient failure of the component that handles quota checking. The component has been corrected. REMEDIATION AND PREVENTION: The incident was resolved when the issue that caused the transient errors went away. To prevent a recurrence of similar incidents, we have enabled additional logging in the affected components so that we can more quickly diagnose and resolve similar issues.
The Upload Outage of July 29, 2014 « Strava Engineering
"Although the range of signed integers goes from -2147483648 to 2147483647, only the positive portion of that range is available for auto-incrementing keys. At 15:10, the upper limit was hit and insertions into the table started failing."
BBC Online Outage on Saturday 19th July 2014
"At 9.30 on Saturday morning (19th July 2014) the load on the database went through the roof, meaning that many requests for metadata to the application servers started to fail. The immediate impact of this depended on how each product uses that data. In many cases the metadata is cached at the product level, and can continue to serve content while attempting to revalidate. In some cases (mostly older applications), the metadata is used directly, and so those products started to fail." "At almost the same time we had a second problem."