HA Xen Cluster with DRBD, LVM and heartbeat

We have implemented a 2-node HA Xen cluster, which consists of two physical machines (hosts,) and runs several virtual servers (guests) each, for our company's internal services (mail, web applications, development, etc.)

When one host gets down unexpectedly, the other host physically kills it (STONITH - power down or reset) and then takes over all the guests the failed host was running.

When we want to shutdown a host machine for maintenance (to replace a fan, add disk or memory, etc.), we just type the usual shutdown command, and the guests are automatically live-migrated to the other host. Since the guest servers keep running throughout the migration process, except for the less than a second pause, users would never even notice the event.

Used Software

We use Debian because we use Debian for most servers. Maybe you can do the same thing with Red Hat, SuSE or something.

Actually we considered to use Ubuntu feisty and gutsy, since they have newer components, but their Xen-ified kernels (2.6.19 and 2.6.22) are too unstable to pass our test.

Only Xen 3.1.2 and Xen official kernel (2.6.18-xen) were stable enough and lived through 72+ hours stress test. We had to port a etch kernel patch for one of our NIC to work, and disabled CONFIG_BRIDGE_NETFILTER in the kernel configuration for stability.

DRBD 8.x is needed for Primary/Primary configuration for live migration. It is not in the etch distribution so we installed from source.

Heartbeat is not 2.x, because we don't have habit to play around with XML configuration files. We also needed to patch heartbeat to overcome the mysterious 41-second death of uptime 5 minite.

Hardware Configuration

Two almost identical machines with:

Not that the machines must work with 2.6.18 kernel. We had to abandon to use a motherboard based on G33 chipset.

STONITH device is designed based on AVR-USB PowerSwitch and built on an 8-pin ATtiny45 and controls photo MOS relay hooked up to power or reset switch header on the motherboard.

Disk Configuration

[We call two host machines xen1 and xen2.]

2 disks per host are used and partitioned as 10GB of md raid1 for host OS's root partition, and the rest for DRBD+LVM for hosting guests.

/dev/md0 2 sets of DRBD+LVM
sda sda1 (10GB) sda2 (240GB) --> drbd1 --> vg1 (xen1 domUs)
sdb sdb1 (10GB) sdb2 (240GB) --> drbd2 --> vg2 (xen2 domUs)

sda2 and sdb2 of both hosts are mirrored by DRBD and accessed as /dev/drbd1 and /dev/drbd2 respectedly. drbd1 and drbd2 in turn form two LVM volume groups, vg1 and vg2 respectedly, and guest disks are allocated from them.

Normally, drbd1 is primary (readable/writable) on xen1, and secondary (standby) on xen2. So it's safe to modify LVM metadata on xen1. Guests that normally run on xen1 are allocated in vg1 (on drbd1.) Likewise, drbd2 is running vg2 and guests normally on xen2.

When one host is down, the other host runs both DRBD, both VG, and all guests, by control of heartbeat.

Network Configuration

Hosts have two GbE NICs each. eth0 of both hosts are connected to LAN, and eth1 are connected each other via a crossover cable.

DRBD, heartbeat and xend (live migration) are running on eth1 (crossover cable), to isolate DRBD traffic (for performance), to avoid single point of failure (assuming a cable never be broken), and to improve security.

Heartbeat Resource Script

When heartbeat (of at least 1.x) hands over a resource (Xen guests in our case) from one node to another, it just stop the resource on one node, then start it on the other node.

To utilize live migration feature of Xen, we wrote a script that handles start/stop request from heartbeat like this:

While live migration is taken place, DRBD and LVM must be activated on both nodes (Primary/Primary). This too is what heartbeat can't handle. So our script does the DRBD and LVM activation and deactivation by itself.

/etc/ha.d/resource.d/xendom

#!/bin/bash
set -e
SELF=10.1.1.1
PEER=10.1.1.2
if [ $(hostname) = xen2 ]; then
    SELF=10.1.1.2
    PEER=10.1.1.1
fi
SSH_OPTS="-o ConnectTimeout=15"
configfile=$1
command=$2
function usage {
    echo "Usage: $0 CFG start|stop|status"
    exit 1
}
if [ ! -r "$configfile" ]; then
    usage
fi
. $configfile
function is_alive {
    #xm list | grep -q "^$1 "
    xm list $1 >/dev/null 2>&1
}
function safe_to_migrate {
    case "$(drbdadm cstate $DRBD)" in
    Connected|SyncSource|SyncTarget)
        return 0
        ;;
    *)
        return 1
        echo "$DRBD is disconnected, NOT safe to migrate"
        ;;
    esac
}
function prepare_migration {
    echo "Preparing for migration:"
    ssh $SSH_OPTS $PEER "drbdadm primary $DRBD && vgscan && vgchange -a y $LVM";
    if [ "$EXTRA_DRBD" ]; then
        ssh $SSH_OPTS $PEER "drbdadm primary $EXTRA_DRBD"
    fi
}
function dom_names {
    ls $CFGDIR | egrep '^[0-9a-z]+$'
}
function start_disk {
    echo "Starting volumes:"
    drbdadm primary $DRBD
    vgscan
    vgchange -a y $LVM
    if [ "$EXTRA_DRBD" ]; then
        drbdadm primary $EXTRA_DRBD
    fi
}
function stop_disk {
    echo "Stopping volumes:"
    vgchange -a n $LVM || true
    drbdadm secondary $DRBD || true
    if [ "$EXTRA_DRBD" ]; then
        drbdadm secondary $EXTRA_DRBD || true
    fi
}
function update_mac_cache {
    arp -d $name >/dev/null 2>&1 || true
    ping -c1 -w1 $name >/dev/null 2>&1 || true
}
function start_domains {
    start_disk
    local name
    for name in $(dom_names); do
        echo -n "Starting $name: "
        if is_alive $name; then
            echo "already running."
        else
            if safe_to_migrate &&
                    ssh $SSH_OPTS $PEER "xm migrate --live $name $SELF"; then
                update_mac_cache
                echo "migrated back."
            else
                xm create -q $CFGDIR/$name
                echo "created."
                sleep 2
            fi
        fi
    done
    if safe_to_migrate; then
        ssh $SSH_OPTS $PEER "vgchange -a n $LVM; drbdadm secondary $DRBD" || true
        if [ "$EXTRA_DRBD" ]; then
            ssh $SSH_OPTS $PEER "drbdadm secondary $EXTRA_DRBD" || true
        fi
    fi
    touch $LOCKFILE
}
function stop_domains {
    rm -f $LOCKFILE
    local migration
    if safe_to_migrate && prepare_migration; then
        migration="OK"
    else
        migration="NG"
    fi
    local name
    for name in $(dom_names); do
        echo -n "Stopping $name: "
        if ! is_alive $name; then
            echo "not running."
        else
            if [ $migration = "OK" ] && xm migrate --live $name $PEER; then
                update_mac_cache
                echo "migrated."
            else
                xm shutdown $name
                echo "shutting down..."
            fi
        fi
    done
    echo -n "Waiting for shutdown complete..."
    local n=0
    while [ $n -lt 60 ]; do
        alive=0
        for name in $(dom_names); do
            if is_alive $name; then
                alive=1
            fi
        done
        if [ $alive = 0 ]; then
            echo "ok"
            break
        fi
        echo -n "."
        sleep 1
        n=$(expr $n + 1)
    done
    for name in $(dom_names); do
        if is_alive $name; then
            echo "Destroying $name"
            xm destroy $name
        fi
    done
    stop_disk
}
function print_status {
    if [ -f $LOCKFILE ]; then
        echo "OK"
    else
        echo "Stopped"
        exit 1
    fi
}
case $command in
start)
    start_domains
    ;;
stop)
    stop_domains
    ;;
status)
    print_status
    ;;
*)
    usage
    ;;
esac

/etc/xen/xen1.cfg

CFGDIR=/etc/xen/xen1
LOCKFILE=/var/lock/xen1domains
DRBD=drbd1
LVM=vg1

/etc/xen/xen2.cfg

CFGDIR=/etc/xen/xen2
LOCKFILE=/var/lock/xen2domains
DRBD=drbd2
LVM=vg2
EXTRA_DRBD=drbd3

/etc/ha.d/haresources

xen1 xendom::/etc/xen/xen1.cfg
xen2 xendom::/etc/xen/xen2.cfg
xen2 MailTo::root

Domain configuration files in /etc/xen/xen1 and /etc/xen/xen2 are handled by this script. Filename of configuration files must be the same as the hostname of the guest.

Miscellaneous things to achieve stability

Some of followings are really effective, but others are maybe not.

Takeshi Sone <ts1 at himeya.com>, Himeya Soft, Inc.