HighlyAvailableAoETarget

In this tutorial we will set up a highly available server providing ATA-over-Ethernet (AoE) targets to AoE initiators. Should a server become unavailable, services provided by our cluster will continue to be available to client systems.

Our highly available system will resemble the following:

AoE server1: node1.home.local IP address: 10.10.1.251
AoE server2: node2.home.local IP address: 10.10.1.252

To begin, set up two Ubuntu 9.04 (Jaunty Jackalope) systems. In this guide, the servers will be set up in a virtual environment using KVM-84. Using a virtual environment will allow us to add additional disk devices and NICs as needed.

The following partition scheme will be used for the Operating System installation:

/dev/vda1 -- 10 GB / (primary' jfs, Bootable flag: on)
/dev/vda5 -- 1 GB swap (logical)

After the installation of a minimal Ubuntu install on both servers, we will install packages required to configure a bonded network interface, and in-turn assign static IP addresses to bond0 of node1 and node2. Using a bonded interface will prevent a single point of failure should the client accessible network fail.

Install ifenslave

apt-get -y install ifenslave

Append the following to /etc/modprobe.d/aliases.conf:

alias bond0 bonding
options bond0 mode=0 miimon=100 downdelay=200 updelay=200 max_bonds=2

Modify our network configuration and assign eth0 and eth1 as slaves of bond0.

Example /etc/network/interfaces:

# The loopback network interface
auto lo
iface lo inet loopback

# The interfaces that will be bonded
auto eth0
iface eth0 inet manual

auto eth1
iface eth1 inet manual

# The target-accessible network interface
auto bond0
iface bond0 inet static
        address 10.10.1.251
        netmask 255.255.255.0
        broadcast 10.10.1.255
        network 10.10.1.0
        gateway 10.10.1.1
        up /sbin/ifenslave bond0 eth0
        up /sbin/ifenslave bond0 eth1

We do not need to define eth0 or eth1 in /etc/network/interfaces as they will be brought up when the bond comes up. I have included them for documentation purposes.

Please note: AoE does not use TCP/IP for communication, it instead uses raw Ethernet frames to carry ATA commands and data. We are assigning an IP address so we can administer the nodes on the public interface.

Review the current status of the bonded interface.

cat /proc/net/bonding/bond0 
Example output:
Ethernet Channel Bonding Driver: v3.3.0 (June 10, 2008)

Bonding Mode: load balancing (round-robin)
MII Status: up
MII Polling Interval (ms): 100
Up Delay (ms): 200
Down Delay (ms): 200

Slave Interface: eth0
MII Status: up
Link Failure Count: 0
Permanent HW addr: 54:52:00:6d:f7:4d

Slave Interface: eth1
MII Status: up
Link Failure Count: 0
Permanent HW addr: 54:52:00:11:36:cf

Please note: A bonded network interface supports multiple modes. In this example eth0 and eth1 are in an round-robin configuration.

Shutdown both servers and add additional devices. We will add additional disks to contain the DRBD meta data and the data that is mirrored between the two servers. We will also add an isolated network for the two servers to communicate and transfer the DRBD data.

The following partition scheme will be used for the DRBD data:

/dev/vdb1 -- 1 GB unmounted (primary) DRBD meta data
/dev/vdc1 -- 1 GB umounted (primary) DRBD device used for AoE configuration files
/dev/vdd1 -- 10 GB unmounted (primary) DRBD device used as the AoE target

Sample output from fdisk -l:

Disk /dev/vda: 10.7 GB, 10737418240 bytes
255 heads, 63 sectors/track, 1305 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Disk identifier: 0x000d570a

   Device Boot      Start         End      Blocks   Id  System
/dev/vda1   *           1        1244     9992398+  83  Linux
/dev/vda2            1245        1305      489982+   5  Extended
/dev/vda5            1245        1305      489951   82  Linux swap / Solaris

Disk /dev/vdb: 1073 MB, 1073741824 bytes
root@node1:~# fdisk -l

Disk /dev/vda: 10.7 GB, 10737418240 bytes
255 heads, 63 sectors/track, 1305 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Disk identifier: 0x000d570a

   Device Boot      Start         End      Blocks   Id  System
/dev/vda1   *           1        1244     9992398+  83  Linux
/dev/vda2            1245        1305      489982+   5  Extended
/dev/vda5            1245        1305      489951   82  Linux swap / Solaris

Disk /dev/vdb: 1073 MB, 1073741824 bytes
16 heads, 63 sectors/track, 2080 cylinders
Units = cylinders of 1008 * 512 = 516096 bytes
Disk identifier: 0xba6f1cad

   Device Boot      Start         End      Blocks   Id  System
/dev/vdb1               1        2080     1048288+  83  Linux

Disk /dev/vdc: 1073 MB, 1073741824 bytes
16 heads, 63 sectors/track, 2080 cylinders
Units = cylinders of 1008 * 512 = 516096 bytes
Disk identifier: 0xdbde4889

   Device Boot      Start         End      Blocks   Id  System
/dev/vdc1               1        2080     1048288+  83  Linux

Disk /dev/vdd: 10.7 GB, 10737418240 bytes
16 heads, 63 sectors/track, 20805 cylinders
Units = cylinders of 1008 * 512 = 516096 bytes
Disk identifier: 0xf505afa1

   Device Boot      Start         End      Blocks   Id  System
/dev/vdd1               1       20805    10485688+  83  Linux

The isolated network between the two servers will be:

AoE server1: node1-private IP address: 10.10.2.251
AoE server2: node2-private IP address: 10.10.2.252

We will again bond these two interfaces. If our server is to be highly available, we should eliminate all single points of failure.

Append the following to /etc/modprobe.d/aliases.conf:

alias bond1 bonding
options bond0 mode=0 miimon=100 downdelay=200 updelay=200

Example /etc/network/interfaces:

# The loopback network interface
auto lo
iface lo inet loopback

# The interfaces that will be bonded
auto eth0
iface eth0 inet manual

auto eth1
iface eth1 inet manual

auto eth2
iface eth2 inet manual

auto eth3
iface eth3 inet manual

# The initiator-accessible network interface
auto bond0
iface bond0 inet static
        address 10.10.1.251
        netmask 255.255.255.0
        broadcast 10.10.1.255
        network 10.10.1.0
        gateway 10.10.1.1
        up /sbin/ifenslave bond0 eth0
        up /sbin/ifenslave bond0 eth1

# The isolated network interface
auto bond1
iface bond1 inet static
        address 10.10.2.251
        netmask 255.255.255.0
        broadcast 10.10.2.255
        network 10.10.2.0
        up /sbin/ifenslave bond1 eth2
        up /sbin/ifenslave bond1 eth3

Ensure that /etc/hosts on both nodes contains the names and IP addresses of the two servers.

Example /etc/hosts:

127.0.0.1       localhost
10.10.1.251     node1.home.local    node1
10.10.1.252     node2.home.local    node2
10.10.2.251     node1-private
10.10.2.252     node2-private

Install NTP to ensure both servers have the same time.

apt-get -y install ntp

You can verify the time is in sync with the date command.

At this point, you can either modprobe the second bond, or restart both servers.

Install drbd and heartbeat.

apt-get -y install drbd8-utils heartbeat

As we will be using heartbeat with drbd, we need to change ownership and permissions on several DRBD related files on both servers.

chgrp haclient /sbin/drbdsetup
chmod o-x /sbin/drbdsetup
chmod u+s /sbin/drbdsetup
chgrp haclient /sbin/drbdmeta
chmod o-x /sbin/drbdmeta
chmod u+s /sbin/drbdmeta

Using /etc/drbd.conf as an example create your resource configuration. We will define two resources.

The drbd device that will contain our AoE configuration files
The drbd device that will become our AoE target

Example /etc/drbd.conf:

resource aoe.config {
        protocol C;
 
        handlers {
        pri-on-incon-degr "echo o > /proc/sysrq-trigger ; halt -f";
        pri-lost-after-sb "echo o > /proc/sysrq-trigger ; halt -f";
        local-io-error "echo o > /proc/sysrq-trigger ; halt -f";
        outdate-peer "/usr/lib/heartbeat/drbd-peer-outdater -t 5";      
        }

        startup {
        degr-wfc-timeout 120;
        }

        disk {
        on-io-error detach;
        }

        net {
        cram-hmac-alg sha1;
        shared-secret "password";
        after-sb-0pri disconnect;
        after-sb-1pri disconnect;
        after-sb-2pri disconnect;
        rr-conflict disconnect;
        }

        syncer {
        rate 100M;
        verify-alg sha1;
        al-extents 257;
        }

        on node1 {
        device  /dev/drbd0;
        disk    /dev/vdc1;
        address 10.10.2.251:7788;
        meta-disk /dev/vdb1[0];
        }

        on node2 {
        device  /dev/drbd0;
        disk    /dev/vdc1;
        address 10.10.2.252:7788;
        meta-disk /dev/vdb1[0];
        }
}

resource aoe.target.0 {
        protocol C;
 
        handlers {
        pri-on-incon-degr "echo o > /proc/sysrq-trigger ; halt -f";
        pri-lost-after-sb "echo o > /proc/sysrq-trigger ; halt -f";
        local-io-error "echo o > /proc/sysrq-trigger ; halt -f";
        outdate-peer "/usr/lib/heartbeat/drbd-peer-outdater -t 5";      
        }

        startup {
        degr-wfc-timeout 120;
        }

        disk {
        on-io-error detach;
        }

        net {
        cram-hmac-alg sha1;
        shared-secret "password";
        after-sb-0pri disconnect;
        after-sb-1pri disconnect;
        after-sb-2pri disconnect;
        rr-conflict disconnect;
        }

        syncer {
        rate 100M;
        verify-alg sha1;
        al-extents 257;
        }

        on node1 {
        device  /dev/drbd1;
        disk    /dev/vdd1;
        address 10.10.2.251:7789;
        meta-disk /dev/vdb1[1];
        }

        on node2 {
        device  /dev/drbd1;
        disk    /dev/vdd1;
        address 10.10.2.252:7789;
        meta-disk /dev/vdb1[1];
        }
}

Duplicate the DRBD configuration to the other server.

scp /etc/drbd.conf root@10.10.1.252:/etc/

Initialize the meta-data disk on both servers.

[node1]drbdadm create-md aoe.config
[node1]drbdadm create-md aoe.target.0
[node1]drbdadm create-md aoe.config
[node2]drbdadm create-md aoe.target.0

We could have initialized the meta-data disk for both resources with:

[node1]drbdadm create-md all
[node2]drbdadm create-md all

If a reboot was not performed post-installation of drbd, the module for DRBD will not be loaded.

Start the drbd service (which will load the module).

[node1]/etc/init.d/drbd start
[node2]/etc/init.d/drbd start

Decide which server will act as a primary for the DRBD device that will contain the AoE configuration files and initiate the first full sync between the two servers.

We will execute the following on node1:

drbdadm -- --overwrite-data-of-peer primary aoe.config

Review the current status of DRBD.

cat /proc/drbd 
Example output:
IT-hash: 9ba8b93e24d842f0dd3fb1f9b90e8348ddb95829 build by ivoks@ubuntu, 2009-01-17 07:49:56
 0: cs:SyncSource ro:Primary/Secondary ds:UpToDate/Inconsistent C r---
    ns:761980 nr:0 dw:0 dr:769856 al:0 bm:46 lo:10 pe:228 ua:256 ap:0 ep:1 wo:b oos:293604
        [=============>......] sync'ed: 72.3% (293604/1048292)K
        finish: 0:00:13 speed: 21,984 (19,860) K/sec
 1: cs:Connected ro:Secondary/Secondary ds:Inconsistent/Inconsistent C r---
    ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:10485692

I prefer to wait for the initial sync to complete before proceeding, however, waiting is not a requirement.

Once completed, format /dev/drbd0 and mount it.

[node1]mkfs.jfs /dev/drbd0
[node1]mkdir -p /srv/data
[node1[mount /dev/drbd0 /srv/data

To ensure replication is working correctly, create data on node1 and then switch node2 to be primary.

dd if=/dev/zero of=/srv/data/test.zeros bs=1M count=100

Switch to node2 and make it the Primary DRBD device:

On node1:
[node1]umount /srv/data
[node1]drbdadm secondary aoe.config
On node2:
[node2]mkdir -p /srv/data
[node2]drbdadm primary aoe.config
[node2[mount /dev/drbd0 /srv/data

You should now see the 100MB file in /srv/data on node2. We will now delete this file and make node1 the primary DRBD server to ensure replication is working in both directions.

Switch to node1 and make it the Primary DRBD device:

On node2:
[node2]rm /srv/data/test.zeros
[node2]umount /srv/data
[node2]drbdadm secondary aoe.config
On node1:
[node1]drbdadm primary aoe.config
[node1]mount /dev/drbd0 /srv/data

Performing an ls /srv/data on node1 will verify the file is now removed and synchronization successfully occured in both directions.

Decide which server will act as a primary for the DRBD device that will be the AoE target and initiate the first full sync between the two servers. We will execute the following on node1:

drbdadm -- --overwrite-data-of-peer primary aoe.target.0

We could have initiated the full sync for both resources with:

drbdadm -- --overwrite-data-of-peer primary all

Next we will install AoE target package. The plan is to have heartbeat control the service instead of init, thus we will prevent AoE from starting with the normal init routines. We will then place the AoE target configuration files on the DRBD device so both servers will have the information available when they are the primary DRBD device.

Install AoE target package on node1 and node2. [node1]sudo apt-get -y install vblade [node2]sudo apt-get -y install vblade

The init scripts for vblade as of 4/22/2009 need to be modified.

The init script attempts to create a PID file in a directory that does not exist
The init script attempts to place a null value in the PID file.

Working /etc/init.d/vblade:
#!/bin/bash

### BEGIN INIT INFO
# Provides:             vblade
# Required-Start:       $network $local_fs $remote_fs
# Required-Stop:
# Default-Start:        2 3 4 5
# Default-Stop:         0 1 6
# Short-Description:    virtual AoE blade emulator
### END INIT INFO

set -e

# /etc/init.d/vblade start and stop the vblade daemon

test -x /usr/sbin/vblade || exit 0

. /lib/lsb/init-functions

test -d /var/run/vblade || mkdir -p /var/run/$prog

RETVAL=0
prog=vblade

test -d /var/run/vblade || mkdir -p /var/run/$prog

start_vblade() {
   ALLOWMACS=""
   [ -n "$5" ] && ALLOWMACS="-m $5"
   ID="$1-e$2.$3"
   PID_FILE=/var/run/$prog/${ID}.pid
   if [ -f $PID_FILE ]; then
        log_daemon_msg "The PID for $ID exists."
        continue
   fi
   $prog $ALLOWMACS $2 $3 $1 $4 >> /var/log/$prog.log 2>&1 &
   pid=$!
   echo $pid > $PID_FILE
   echo -n $"$4 (e$2.$3@$1) [pid $pid]"
   [ "$RETVAL" = 0 ] && log_end_msg 0 || log_end_msg 1
   echo
}

start() {
   log_daemon_msg "Starting vblade daemons" "vblade"
   sed /^#/d /etc/$prog.conf | sed /^$/d | while read line
   do
         start_vblade $line
   done
}

stop() {
   log_daemon_msg "Stopping vblade daemons" "vblade"
   log_progress_msg "vblade"
   if ! ls /var/run/vblade/*.pid 2> /dev/null; then
        log_daemon_msg "No vblade to stop"
   else
      for pidfile in `ls /var/run/$prog/*.pid`
         do
         kill -9 `cat $pidfile`
         rm -f $pidfile
       done
   fi
  echo
}


case "$1" in
        start)
                start
                ;;
        stop)
                stop
                ;;
        restart|force-reload)
                stop
                start
                ;;
        reload)
                stop
                start
                ;;
        *)
        echo $"Usage: $0 {start|stop|restart|reload|restart|force-reload}"
        RETVAL=1
esac

exit 0

Temporarily stop vblade

/etc/init.d/vblade stop

Remove vblade from the init scripts.

update-rc.d -f vblade remove

Relocate vblade configuration to /srv/data/aoe:

[node1]mkdir -p /srv/data/aoe
[node1]mv /etc/vblade.conf /srv/data/aoe
[node1]ln -s /srv/data/aoe/vblade.conf /etc/vblade.conf
[node2]rm /etc/vblade.conf
[node2]ln -s /srv/data/aoe/vblade.conf /etc/vblade.conf

Define our AoE target.

Vblade defines AoE targets in /etc/vblade.conf

Example /etc/vblade.conf:

# network_device shelf slot file/disk/partition mac[,mac[,mac]]
bond0 0 1 /dev/drbd1

The above example:

Will use bond0 as out network interface
Defines our AoE target as shelf 0 slot 1
Defines the device associated with the AoE target

Last but not least configure heartbeat to failover AoE in case a node fails.

On node1, define the cluster within /etc/heartbeat/ha.cf.

Example /etc/heartbeat/ha.cf:

logfacility     local0
keepalive 2
deadtime 30
warntime 10
initdead 120
bcast bond0
bcast bond1
node node1
node node2

On node1, define the authentication mechanism within /etc/heartbeat/authkeys the cluster will use.

Example /etc/heartbeat/authkeys:

auth3
3 md5 password

Change the permissions of /etc/heartbeat/authkeys.

chmod 600 /etc/heartbeat/authkeys

On node1, define the resources that will run on the cluster within /etc/heartbeat/haresources. We will define the master node for the resource, the Virtual IP address, the file systems used, and the service to start.

Example /etc/heartbeat/haresources:

node1 drbddisk::aoe.config Filesystem::/dev/drbd0::/srv/data::jfs
node1 drbddisk::aoe.target.0 vblade

Copy the cluster configuration files from node1 to node2.

[node1]scp /etc/heartbeat/ha.cf root@10.10.1.252:/etc/heartbeat/
[node1]scp /etc/heartbeat/authkeys root@10.10.1.252:/etc/heartbeat/
[node1]scp /etc/heartbeat/haresources root@10.10.1.252:/etc/heartbeat/

At this point you can either:

Unmount /srv/data, make node1 secondary for drbd, and start heartbeat
Reboot both servers

To test connectivity to our new AoE target, configure an additional system to be an initiator.

I will use Ubuntu 9.04 (Jaunty Jackalope) for this as well.

Install the AoE userland tools.

apt-get -y install aoetools

The default configuration does not automatically start AoE communication. Modify /etc/default/aoetools to indicate which interface AoE should communicate on.

sed -i 's/INTERFACES="none"/INTERFACES="eth0"/' /etc/default/aoetools

Start the aoetools init script. This will cause a discovery to start.

Review the status of AoE discovery.

aoe-stat
Example output:
      e0.1        10.737GB   eth0 up

Once the available LUNs are discovered, as expected, we have a new disk.

Example fdisk -l sample output:

Disk /dev/vda: 10.4 GB, 10485760000 bytes
255 heads, 63 sectors/track, 1274 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Disk identifier: 0x000478c5

   Device Boot      Start         End      Blocks   Id  System
/dev/vda1   *           1        1214     9751423+  83  Linux
/dev/vda2            1215        1274      481950    5  Extended
/dev/vda5            1215        1274      481918+  82  Linux swap / Solaris

Disk /dev/etherd/e0.1: 10.7 GB, 10737345024 bytes
255 heads, 63 sectors/track, 1305 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Disk identifier: 0x00000000

Disk /dev/etherd/e0.1 doesn't contain a valid partition table

Create a partition and file-system on our new AoE device.

fdisk /dev/etherd/e0.1
Command (m for help):                                                   <--- n
Command action
   e   extended
   p   primary partition (1-4)                                          <--- p
Partition number (1-4):                                                 < ---1 
First cylinder (1-10239, default 1):                                    <--- enter
Last cylinder, +cylinders or +size{K,M,G} (1-10239, default 10239):     <--- enter
Command (m for help):                                                   <--- w
mkfs.jfs -q /dev/etherd/e01.p1

Create a mount point for the new file-system.

mkdir -p /mnt/aoe

We can either use /etc/fstab to mount our file-system at boot time, or we can add it to /etc/default/aoetools. To keep all the disks in one location, we will place an entry in /etc/fstab. Modern distros prefer to use the disk's UUID for mounting in fstab, referring to the device "old school" still works as well.

The aoetools script is run after fstab is parsed and file systems are loaded. We can either move the init scripts to support AoE discovery before /etc/fstab is parsed, or we can use the "old school" enty in /etc/fstab. The aoetools init script will parse /etc/fstab and mount file-systems with entries for /dev/etherd.

Add our new AoE disk to /etc/fstab.

printf "/dev/etherd/e01.p1\t/mnt/aoe\tjfs\tnoatime\t0\t0\n" >> /etc/fstab

Mount our new AoE block device.

mount /mnt/aoe

Create data on the initiator node and test failover of the target. I prefer using a movie or a sound file as this will help show latency. Once you have the test data availble, play the movie or the mp3, and instruct node1 it is no longer a member of the cluster. This can be done by simply shutting down heartbeat.

[node1]/etc/init.d/heartbeat stop

Once you have tested the latency of the data transfer when node1 fails, start heart beat on node1, this will in-turn move the resources back to node1

[node1]/etc/init.d/heartbeat start

An alternative test would be to failover the nodes while writing data.

Ubuntu Documentation