Google Compute Engine and Predictable Performance

I raised my eyebrows at one statement Google is making about Google Compute Engine:

Deploy your applications on an infrastructure that provides consistent performance. Benefit from a system designed from the ground up to provide strong isolation of users’ actions. Use our consistently fast and dependable core technologies, such as our persistent block device, to store and host your data.

While many talk about how one IaaS solution will give you better performance than another, one of the more bothersome issues in clouds is whether or not an instance will give you consistent performance. This is especially true with I/O.

A lot of this performance consistency problem is due to the “noisy neighbor” issue. IaaS solutions typically have some kind of multi-tenant support, multiple isolated containers (VM instances, zones, etc.) on each physical server. The underlying kernel/hypervisor is responsible for cutting each tenant off at the proper times to make sure the raw resources are shared correctly (according to whatever policy is appropriate).

AWS, while nailing many things, has struggled with this. I’ve heard from many users that they’re running performance tests on every EC2 instance they create in order to see if the neighbor situation looks good. This only gets you so far, of course: a particulary greedy neighbor could be provisioned to the same physical node at a later time.

Taking the concept further, I’ve been in a few conversations where the suggestion is to play “whack-a-mole” and constantly monitor the relative performance, steal time, etc., and move things around whenever it’s necessary. (That sounds like a great CS paper, but stepping back… that’s just kind of weird and crazy to me if this is the best we can do.)

The best approach on most clouds (except Joyent who claims to have a better situation) is to therefore use the biggest instances, if you can afford them. These will take up either half or all of the typical ~64-70GB RAM in the servers underlying the VM: no neighbors, no problems. Though other kinds of “neighbors” are still an issue, like if you’re using a centralized, network-based disk.

So how serious is Google in the opening quote above? What different technology is being used on GCE?

A Google employee (who does not work on the GCE team but who I assume is fairly reporting from the Google I/O conference) tweeted the following:

Google compute is based on KVM Linux VMs. Storage: local ephemeral, network block, google storage #io12

KVM.

Years ago, we investigated various techniques we could use in the Nimbus IaaS stack to guarantee that guests only used a given amount of CPU percentage and network bandwidth while also allowing colocated guests to enjoy their own quota. Pure CPU workloads fared well against “hostile” CPU based workloads. But once you introduced networking, the situation was very bad.

The key to these investigations is introducing pathologically hostile neighbors and seeing what you can actually guarantee to other guests, including all of the overhead that goes into accounting and enforcement.

That was on Xen, and it’s not even something the Xen community was ignoring, it’s just a hard problem. And since then I’ve seen that the techniques and Xen guest schedulers have improved.

But I haven’t seen much attention to this in KVM (though I admit I haven’t had the focus on this area that I had in the past).

So we have this situation:

AWS uses Xen.
AWS and Xen historically have issues with noisy neighbors.
Google uses KVM, not historically known for strong resource isolation.
Google is claiming consistent performance as a strong selling point.

Do they have their own branch, a new technique? Are they actually running SmartOS zones + KVM? I’m really curious what is happening here. Surely they’ve seen this has been an issue for people on AWS for years and would not make such a bold claim without testing the hell out of it, right?

Another thing they’re claiming is a “consistently fast and dependable” network block device. Given the a priori failure mode problems of these solutions, I’m doubly curious.

UPDATE: This talk from Joe Beda has some new information, slide 14: Linux cgroups – I also heard via @lusis that they worked with RedHat on this.

UPDATE: comment from Joe Beda:

“We are obviously worried about cascading failures and failure modes in general. Our industry, as a whole, has more work to do here. This is an enormously difficult problem and I’m not going to start throwing rocks.

That being said, I can tell you that our architecture for our block store is fundamentally different from what we can guess others are doing and, I think, provides benefits in these situations. We can take advantage of some of the great storage infrastructure at Google (BigTable, Colossus) and build on that. Our datacenters are really built to fit these software systems well.”