Core Concepts: Virtual Cluster Appliances

(Originally posted on a shared blog that is now offline)

In part two of the series, I want to flash forward to the present to discuss some recent challenges we are addressing for users of the Workspace Service. There will be plenty of time to talk about more fundamental things like "why are people even interested in using VMs for grid computing?" (for now, you could have a look at this paper for some general arguments).

To the fun stuff. Virtual Cluster Appliances are VMs that automatically and securely work together in new contexts. A virtual appliance is commonly defined as a VM that can be downloaded and used for a specific purpose with minimum intervention after booting locally. We could say then that a virtual cluster appliance is a set of VMs that can be deployed and used for a specific purpose with minimum (most usefully ZERO) intervention after booting.

Say you would like to start a VM-based cluster and treat it like a new, dynamically-added site on your grid. This will probably have several specialized nodes to do cross-network/organizational work and any number of "work" nodes to run jobs.

A very simple example cluster might have a head node VM image with GRAM, Torque, and GridFTP. And then N compute nodes with Torque's agent on them, all started from a copy of the same VM image.

You already have obtained or created some VM template images, you have access to hypervisors (and time reserved on them), and all of the middleware is in place to deploy everything securely and correctly. (As we will discuss in the series, all of these things might be safe to compartmentalize conceptually but are not trivial to get "right")

But even with all of this in place, how do you start these disparate VMs as a working cluster at a new grid site? Some issues:

The network addresses assigned to the VMs in all likelihood be different on different deployment sites, so any configurations based on IPs or hostnames will not work. How does the shared filesystem know which hosts to authorize? How does the resource manager (job scheduler) know where to dispatch jobs? What about X509 certificates that need hostnames embedded in their subjects? Keep adding questions in this vein as the cluster topology becomes more complex.
You could conceivably try and allow the same image bits to work anywhere by using DNS facades and IP tunnels on the "outside" of the VMs. But this runs into a number of problems, the main ones being performance and the fact that in many applications the local network addresses that the VM "sees" are placed high in the stack of many messages. Using NAT or ALG technologies would either be cumbersome, require a lot of application casing, or would not even work in many application cases (for example when message payloads are signed/encrypted).
There may often be site provided services a) that the VM will need to know how to contact, b) that the VM will need credentials in order to contact, and c) that the VM will itself need to authenticate once contacted.
There are many cases where a particular onboard policy or configuration is not related to the network or the current deployment site. But it is still very helpful to have the ability to start different clusters with different information specified at deployment time.
For example, trusted VM template images could be made available to a wide number of users who "personalize" them at deployment, for example only allowing the correct entities to submit work. This could be described as a VM being "adapted to an organization."
Another example: with contextualization technology the very roles the VM is playing can be picked at deployment time and the same VM image (equal bits) can differentiate into different things after booting based on the instructions provided by the client.

The main idea behind the core contextualization technology is not hard to get. When deploying a workspace that is involved in contextualization, a piece of metadata is included that specifies labels which we call roles. A role is either provided or required. Specific roles are opaque to the contextualization infrastructure which only treats them as unique keys. The only place they are meaningful is inside the VM and when the provided metadata is constructed.

Let's use Torque as an example. The two roles could be named for example "torquemaster" and "torqueslave". You deploy one workspace that provides the "torquemaster" role and 100 that provide the "torqueslave" role. And vice versa for requires. For example, each of the slaves requires knowledge of the torquemaster role.

As you initially deploy different VMs, the metadata about their required and provided roles is added to a secure context. This context serves as a "blackboard" for incoming information that fills in the "answers" to these requirements.

As information becomes available from the deployment infrastructure (e.g. network addresses), from the VMs themselves (e.g. individual SSHd public host keys that were generated at boot which is necessary in the Torque case), or from the grid client (e.g. grid-mapfile contents), it is added to the context. The VMs then retrieve information from the context service specifically tailored for them.

For example the VM destined to be the Torque server contacts the context service and retrieves its list of Torque clients. The contextualization script on the VM looks for a local script called 'torqueclient' and invokes it with a known parameter list. In this manner, you can enable new roles on your VMs without needing to bother with any of the context service interaction. And this is why you can use one VM image to play several different roles (you can house config scripts for many roles on the VM but only a certain subset of them could be 'activated' by the deployment metadata specification).

I left out a lot of the mechanics/ordering here which will be covered in subsequent entries. Right off the bat you may be asking yourself "how does this all happen safely without trusted keys set up out of band" and the answer is that it doesn't, that is set up before the call to the context service. You're also asking "isn't it bad to require the VM to be rigged with a context service address to contact ahead of time?" and the answer is that it isn't actually rigged but supplied at boot.

The crux of the solution for these issues is that a) one can bootstrap VM deployments with enough information to set up secure channels and b) one can do this bootstrapping at deployment time thus not interfering with the very powerful mechanism of using secret-less, policy-less VM templates. We'll look at these topics in future entries.

The current implementation is being used to start 100 node clusters deployed by the STAR community on Amazon's Elastic Compute Cloud (EC2). The on-VM context script is a standalone Python file which is slim on dependencies which should suit all but the most extremely bare VMs. As Kate mentions in the recent TP1.3 release announcement, we are working on including this technology in an upcoming release!