Configuring Disk-Based SBD in an Existing High Availability Cluster
- WHAT?
How to use the CRM Shell to configure disk-based SBD in a High Availability cluster that is already installed and running.
- WHY?
To be supported, all SUSE Linux Enterprise High Availability clusters must have STONITH (node fencing) configured. SBD provides a node fencing mechanism without using an external power-off device.
- EFFORT
Configuring disk-based SBD in an existing cluster only takes a few minutes and does not require any downtime for cluster resources.
- GOAL
Protect the cluster from data corruption by fencing failed nodes.
- REQUIREMENTS
An existing SUSE Linux Enterprise High Availability cluster
Shared storage accessible from all cluster nodes
A hardware watchdog device on all cluster nodes
If the SBD service is already running, see Changing the Configuration of SBD.
1 What is STONITH? #
In a split-brain scenario, cluster nodes are divided into two or more groups (or partitions) that do not know about each other. This might be because of a hardware or software failure, or a failed network connection, for example. A split-brain scenario can be resolved by fencing (resetting or powering off) one or more of the nodes. Node fencing prevents a failed node from accessing shared resources and prevents cluster resources from running on a node with an uncertain status. This helps protect the cluster from data corruption.
SUSE Linux Enterprise High Availability uses STONITH as the node fencing mechanism. To be supported, all SUSE Linux Enterprise High Availability clusters must have at least one STONITH device. For critical workloads, we recommend using two or three STONITH devices. A STONITH device can be either a physical device (a power switch) or a software mechanism (SBD in combination with a watchdog).
1.1 Components #
- pacemaker-fenced
The
pacemaker-fenceddaemon runs on every node in the High Availability cluster. It accepts fencing requests frompacemaker-controld. It can also check the status of the fencing device.- STONITH resource agent
The interface between the cluster and the fencing device. Every supported fencing device can be controlled by a specific STONITH resource agent.
- STONITH device
The device that resets or powers off a node when requested by the cluster. The STONITH device you use depends on your budget and hardware.
1.2 STONITH devices #
- Physical devices
Power Distribution Units (PDU) are devices with multiple power outlets that can provide remote load monitoring and power recycling.
Uninterruptible Power Supplies (UPS) provide emergency power to connected equipment in the event of a power failure.
Blade power control devices can be used for fencing if the cluster nodes are running on a set of blades. This device must be capable of managing single-blade computers.
Lights-out devices are network-connected devices that allow remote management and monitoring of servers.
- Software mechanisms
Disk-based SBD fences nodes by exchanging messages via shared block storage. It works together with a watchdog on each node to ensure that misbehaving nodes are really stopped.
Diskless SBD fences nodes by using only the watchdog, without a shared storage device. Unlike other STONITH mechanisms, diskless SBD does not need a STONITH resource agent.
fence_kdump checks if a node is performing a kernel dump. If so, the cluster acts as if the node was fenced. This avoids fencing a node that is already down but doing a dump. This resource agent must be used together with a physical STONITH device. It cannot be used with SBD.
1.3 For more information #
For more information about fencing and STONITH, see https://clusterlabs.org/projects/pacemaker/doc/3.0/Pacemaker_Explained/html/fencing.html.
For a full list of supported STONITH devices, run the
crm ra list stonith command.
For details about a specific STONITH device, run the
crm ra info STONITH_DEVICE command.
2 What is SBD? #
SBD (STONITH Block Device) provides a node fencing mechanism without using an external power-off device. The software component (the SBD daemon) works together with a watchdog device to ensure that misbehaving nodes are fenced. SBD can be used in disk-based mode with shared block storage, or in diskless mode using only the watchdog.
Disk-based SBD uses shared block storage to exchange fencing messages between the nodes. It can be used with one to three devices. One device is appropriate for simple cluster setups, but two or three devices are recommended for more complex setups or critical workloads.
Diskless SBD fences nodes by using only the watchdog, without relying on a shared storage device. A node is fenced if it loses quorum, if any monitored daemon is lost and cannot be recovered, or if Pacemaker determines that the node requires fencing.
2.1 Components #
- SBD daemon
The SBD daemon starts on each node before the rest of the cluster stack and stops in the reverse order. This ensures that cluster resources are never active without SBD supervision.
- SBD device (disk-based SBD)
A small logical unit (or a small partition on a logical unit) is formatted for use with SBD. A message layout is created on the device with slots for up to 255 nodes.
- Messages (disk-based SBD)
The message layout on the SBD device is used to send fencing messages to nodes. The SBD daemon on each node monitors the message slot and immediately complies with any requests. To avoid becoming disconnected from fencing messages, the SBD daemon also fences the node if it loses its connection to the SBD device.
- Watchdog
SBD needs a watchdog on each node to ensure that misbehaving nodes are really stopped. SBD “feeds” the watchdog by regularly writing a service pulse to it. If SBD stops feeding the watchdog, the hardware enforces a system restart. This protects against failures of the SBD process itself, such as becoming stuck on an I/O error.
2.2 Limitations and recommendations #
- Disk-based SBD
The shared storage can be Fibre Channel (FC), Fibre Channel over Ethernet (FCoE), or iSCSI.
The shared storage must not use host-based RAID, LVM, Cluster MD, or DRBD.
Using storage-based RAID and multipathing is recommended for increased reliability.
If a shared storage device has different
/dev/sdXnames on different nodes, SBD communication will fail. To avoid this, always use stable device names, such as/dev/disk/by-id/DEVICE_ID.An SBD device can be shared between different clusters, up to a limit of 255 nodes.
When using more than one SBD device, all devices must have the same configuration.
- Diskless SBD
Diskless SBD cannot handle a split-brain scenario for a two-node cluster. This configuration should only be used for clusters with more than two nodes, or in combination with QDevice to help handle split-brain scenarios.
2.3 For more information #
For more information, see the man page sbd or run the
crm sbd help command.
3 Setting up the SBD watchdog #
SBD needs a watchdog on each node to ensure that misbehaving nodes are really stopped. SBD “feeds” the watchdog by regularly writing a service pulse to it. If SBD stops feeding the watchdog, the hardware enforces a system restart. This protects against failures of the SBD process itself, such as becoming stuck on an I/O error.
Hardware-specific watchdog drivers are available as kernel modules. However, sometimes the wrong watchdog module loads automatically. Use this procedure to make sure the correct module is loaded.
softdog limitations
If no hardware watchdog is available, crmsh automatically configures the software
watchdog (softdog) when configuring SBD. This watchdog can be
used for testing purposes, but is not recommended for production
environments.
The softdog driver assumes that at least one CPU is still
running, so if all CPUs are stuck, softdog cannot reboot the
system. Hardware watchdogs work even if all CPUs are stuck.
Perform this procedure on all nodes in the cluster:
List the drivers that are installed with your kernel version:
>rpm -ql kernel-VERSION | grep watchdogTo help you find the correct driver for your hardware, see Table 1, “Commonly used watchdog drivers”. However, this is not a complete list and might not be accurate for your specific system. Check your system's hardware configuration if possible, or ask your hardware or system vendor for details about system-specific watchdog configuration.
Check whether any watchdog modules are already loaded in the kernel:
>lsmod | egrep "(wdt|dog)"If the correct watchdog module is already loaded, you can skip to Step 7.
If the wrong watchdog module is loaded, you can unload it with the following command:
>sudo rmmod WRONG_MODULEEnable the watchdog module that matches your hardware:
>sudo bash -c "echo WATCHDOG_MODULE > /etc/modules-load.d/watchdog.conf"If you run this command as the
rootuser, you can omitbash -cand the quotes (""):#echo WATCHDOG_MODULE > /etc/modules-load.d/watchdog.confReload the kernel modules:
>sudo systemctl restart systemd-modules-loadCheck whether the watchdog module is loaded correctly:
>lsmod | egrep "(wdt|dog)"Verify that at least one watchdog device is available:
>sudo sbd query-watchdogIf no watchdog device is available, you might need to use a different driver.
Verify that the watchdog device works:
>sudo sbd -w /dev/WATCHDOG_DEVICE test-watchdogIf the test is successful, the node reboots.
SBD must be the only software that accesses the watchdog timer. Some hardware vendors ship systems management software that uses the watchdog for system resets (for example, the HP ASR daemon). If this is the case, disable the additional software.
| Hardware | Driver |
|---|---|
| HP | hpwdt |
| Dell, Lenovo (Intel TCO) | iTCO_wdt |
| Fujitsu | ipmi_watchdog |
| LPAR on IBM Power | pseries-wdt |
| VM on IBM z/VM | vmwatchdog |
| VM on VMware vSphere | wdat_wdt |
4 Setting up disk-based SBD #
Disk-based SBD fences nodes by exchanging messages via shared block storage. It works together with a watchdog on each node to ensure that misbehaving nodes are really stopped. You can configure up to three SBD devices.
This procedure explains how to configure SBD after the cluster is already installed and running, not during the initial cluster setup.
In this procedure, the script checks whether it is safe to restart the cluster services automatically. If any non-STONITH resources are running, the script warns you to restart the cluster services manually. This allows you to put the cluster into maintenance mode first to avoid resource downtime. However, be aware that the resources will not have cluster protection while in maintenance mode.
Make sure any device you want to use for SBD does not hold any important data. Configuring a device for use with SBD overwrites the existing data.
An existing High Availability cluster is already running.
The SBD service is not running.
Shared storage is configured and accessible on all nodes.
The path to the shared storage device is consistent across all nodes. Use stable device names such as
/dev/disk/by-id/DEVICE_ID.All nodes have a watchdog device, and the correct watchdog kernel module is loaded.
Perform this procedure on only one cluster node:
Log in either as the
rootuser or as a user withsudoprivileges.Run the SBD stage of the cluster setup script, using the option
--sbd-device(or-s) to specify the shared storage device:>sudo crm cluster init sbd --sbd-device /dev/disk/by-id/DEVICE_IDAdditional options #You can use
--sbd-device(or-s) multiple times to configure up to three SBD devices. Each SBD device must use a different shared storage device.If multiple watchdogs are available, you can use the option
--watchdog(or-w) to choose which watchdog to use. Specify either the device name (for example,/dev/watchdog1) or the driver name (for example,iTCO_wdt).
The script initializes SBD on the shared storage device, creates a
stonith:fence_sbdcluster resource, and updates the SBD configuration file and timeout settings. The script also checks whether it is safe to restart the cluster services automatically. If any non-STONITH resources are running, the script warns you to restart the cluster services manually.If you need to restart the cluster services manually, follow these steps to avoid resource downtime:
Put the cluster into maintenance mode:
>sudo crm maintenance onIn this state, the cluster stops monitoring all resources. This allows the services managed by the resources to keep running while the cluster restarts. However, be aware that the resources will not have cluster protection while in maintenance mode.
Restart the cluster services on all nodes:
>sudo crm cluster restart --allCheck the status of the cluster:
>sudo crm statusThe nodes will have the status
UNCLEAN (offline), but will soon change toOnline.When the nodes are back online, put the cluster back into normal operation:
>sudo crm maintenance off
Check the SBD configuration:
>sudo crm sbd configure showThe output of this command shows the SBD device's metadata, the enabled settings in the
/etc/sysconfig/sbdfile, and the SBD-related cluster settings.Check the status of SBD:
>sudo crm sbd statusThe output of this command shows the type of SBD configured, information about the SBD watchdog, and the statuses of the SBD service, disk, and cluster resource.
5 Testing SBD and node fencing #
Verify that SBD works as expected by performing one or more of the following tests:
5.1 Checking SBD communication #
Check whether the SBD device can send and receive messages between the nodes.
This procedure uses example nodes called alice and
bob.
On either node, list the node slots and their current messages from the SBD device:
>sudo sbd -d /dev/disk/by-id/DEVICE_ID list0 alice clear 1 bob clearOn
bob, send a test message toalice:>sudo sbd -d /dev/disk/by-id/DEVICE_ID message alice testOn
alice, check/var/log/messagesfor the message frombob:>sudo cat /var/log/messages | grep "test"[...] Received command test from bob on disk /dev/disk/by-id/DEVICE_IDThis confirms that SBD is running and ready to receive messages.
5.2 Testing cluster failures #
The crm cluster crash_test command simulates cluster failures and
reports the results. To test SBD and node fencing, you can run one or more of the
tests --fence-node, --kill-sbd and
--split-brain-iptables.
The command supports the following checks:
--fence-node NODEFences a specific node passed from the command line.
--kill-sbd/--kill-corosync/--kill-pacemakerdKills the daemons for SBD, Corosync, or Pacemaker. After running one of these tests, you can find a report in the directory
/var/lib/crmsh/crash_test/. The report includes a test case description, action logging, and an explanation of possible results.--split-brain-iptablesSimulates a split-brain scenario by blocking the Corosync port, and checks whether one node can be fenced as expected. You must install iptables before you can run this test.
For more information, run the crm cluster crash_test --help command.
This example uses nodes called alice and
bob, and tests fencing bob. To watch
bob change status during the test, you can log in to Hawk and
navigate to › , or
run crm status from another node.
admin@alice>sudo crm cluster crash_test --fence-node bob============================================== Testcase: Fence node bob Fence action: reboot Fence timeout: 95 !!! WARNING WARNING WARNING !!! THIS CASE MAY LEAD TO NODE BE FENCED. TYPE Yes TO CONTINUE, OTHER INPUTS WILL CANCEL THIS CASE [Yes/No](No):YesINFO: Trying to fence node "bob" INFO: Waiting 95s for node "bob" reboot... INFO: Node "bob" will be fenced by "alice"! INFO: Node "bob" was fenced by "alice" at DATE TIME
6 Legal Notice #
Copyright© 2006–2025 SUSE LLC and contributors. All rights reserved.
Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or (at your option) version 1.3; with the Invariant Section being this copyright notice and license. A copy of the license version 1.2 is included in the section entitled “GNU Free Documentation License”.
For SUSE trademarks, see https://www.suse.com/company/legal/. All other third-party trademarks are the property of their respective owners. Trademark symbols (®, ™ etc.) denote trademarks of SUSE and its affiliates. Asterisks (*) denote third-party trademarks.
All information found in this book has been compiled with utmost attention to detail. However, this does not guarantee complete accuracy. Neither SUSE LLC, its affiliates, the authors, nor the translators shall be held liable for possible errors or the consequences thereof.