DRBD, LVM, GNBD, and Xen for free and reliable SAN

January 26, 2007

At home, I wanted a reliable disk solution for backups and also wanted a big, blank and resizable storage system for virtual machines. I knew I wanted to be able to get at the shared disk remotely from other nodes and wanted to be able to replace broken hardware quickly if something failed. I also didn’t want to spend a lot of time reconfiguring OSs and software in the case of a total system failure.

I have two cheap computers and so I put some big disks in them and mirrored the disks over the network. Instead of using one file server node and RAID1, this is something like a “whole system RAID”. If anything at all breaks in either computer, hosted services can keep running and data is unharmed except for whatever was unsynced in RAM.

To accomplish the disk mirror I used DRBD. DRBD is a special block device that is designed for highly available clusters, it mirrors activity directly at the block device level across the network to another disk. So like a RAID1 configuration over the network. It lets you build something like the shared storage devices on a SAN, but without any special hardware. This provides the basic reliability layer.

diagram: two hosts mirrored with DRBD over crossover cable

Linux Logical Volume Management (LVM) is a popular tool that lets you flexibly manage disk space. Instead of just partitioning the disk, using LVM lets us do on-the-fly logical partition resizing, snapshots (including hosting snapshot+diffs), and adding more physical disks into the volume group as needs grow (you can even resize a logical partition across multiple underlying disks). Each logical partition is formatted with a filesystem of its own. Using LVM avoided some future headaches I think.

That is how the disk is setup, now how to access it remotely? You could run a shared filesystem of course, exporting via an NFS server on host A (or B). Instead, having heard good things about Global Network Block Device (GNBD) on the Xen mailing lists, I chose to export the logical block devices (from LVM) directly over the network with GNBD. Another node makes a GNBD import and the block device appears to be a local block device there, ready to mount into the file hierarchy. This is like iSCSI but it is a snap to set up and use.

And if that other node is a Xen domain 0, that block device is very handily ready to be used as a VM image, just as if it was a raw partition on that node.

diagram: one of the nodes of the disk array exporting an LVM partition over GNBD to a Xen dom0

Here’s an example Xen configuration using the imported block device:

disk = [ ‘phy:/dev/gnbd/vmimage001,sda1,w’]

The guest VM needs no awareness of all these tools, it just sees its sda1 and mounts it like anything else Xen presents to it as one of its “hardware” partitions.

Instead of just using the file store for backups and VMs that are used intermittently, I’m also running persistent services like websites, the incremental-backup server and a media server in VMs stored there.

First, this allows for basic backups of the LAN services without any backup software, that’s nice to have, although I really prefer a combination of incremental backups and RAID1. (Here we also avoid a Russell’s paradox situation with the backup server).

Second, keeping time-consuming-to-configure services in a VM allows me to replace hardware quickly, including whole computers in the event of a total failure: the only software I’ll ever need to reinstall is {Linux, DRBD, LVM, GNBD} for a file server node and {Linux, Xen} for a VMM node.

As long as net latency is really low (here it is sub-ms) it doesn’t really matter that the disk is remote for any of my uses. The VMs always respond very well.

(I should mention: you could of course take GNBD out of the picture and run the VMs on host A and B if Xen were installed there)

Another bonus: using GNBD, you can live-migrate the VM to any node that can do a GNBD import. This is nice to have. I only live migrate manually, though. Both DRBD and GNBD have some features that allow for seamless failover but I don’t really need any of this at home.

To learn more about that, check out this paper on the new DRBD (it is interesting): http://www.drbd.org/fileadmin/drbd/publications/drbd8_wpnr.pdf

Thinking about high availability in this kind of setup for a minute, a possible and simple to execute arrangement for services that need to be up at all times would be to take two DRBD mirrored nodes, run VMs on one or both of them, and have the physical nodes heartbeat each other. This is a simpler approach than a centralized file server with block device export, here we just have two peer VMMs that are “watching out” for each other.

You’d have two master/slave arrangements, so in the normal operating case: one VMM with partition A as DRBD master and partition B as secondary, then on the other VMM you have partition A as secondary and partition B as master. VMs run from a partition that is the DRBD master.

Let’s say you split four services into four VMs and put two VMs on each physical node. One of the physical node’s disks fail entirely and a monitor process notices. The heartbeat script makes sure the OK node is now the DRBD master for both partition A and B. Then it boots the two VMs previously hosted on the failed node on the OK node, re-allocating RAM for the time being to accomodate all four VMs.

diagram: 2 VMs migrate to the OK node

The applications in those VMs recover just as if they went through a normal hard system reset (their network addresses can stay the same since both physical nodes are on the same LAN). Once the administrator gets the alert email and puts a new disk in, another script is ready to resync DRBD and then migrate two of the VMs back to their normal place.

This seems like something to consider for a highly hammered and important head node (like a Globus GRAM node for example). All it takes is another node, commodity hardware and open source software!