This is a draft document that was built and uploaded automatically. It may document beta software and be incomplete or even incorrect. Use this document at your own risk.

Jump to contentJump to page navigation: previous page [access key p]/next page [access key n]
Configuring Diskless SBD in an Existing High Availability Cluster
SUSE Linux Enterprise High Availability 16.0

Configuring Diskless SBD in an Existing High Availability Cluster

Publication Date: 21 Jan 2026
WHAT?

SBD provides a node fencing mechanism without using an external power-off device. Node fencing protects the cluster from data corruption by resetting failed nodes.

WHY?

To be supported, all SUSE Linux Enterprise High Availability clusters must have node fencing configured.

EFFORT

Configuring diskless SBD in an existing cluster only takes a few minutes and does not require any downtime for cluster resources.

GOAL

SBD can be configured during the initial cluster setup or later in a running cluster. This article explains how to configure SBD in a High Availability cluster that is already installed and running.

REQUIREMENTS
  • An existing SUSE Linux Enterprise High Availability cluster.

  • A hardware watchdog device on all cluster nodes.

To configure disk-based SBD instead, see Configuring Disk-Based SBD in an Existing High Availability Cluster.

If the SBD service is already running, see Changing the Configuration of SBD.

1 What is node fencing?

In a split-brain scenario, cluster nodes are divided into two or more groups (or partitions) that do not know about each other. This might be because of a hardware or software failure, or a failed network connection, for example. A split-brain scenario can be resolved by fencing (resetting or powering off) one or more of the nodes. Node fencing prevents a failed node from accessing shared resources and prevents cluster resources from running on a node with an uncertain status. This helps protect the cluster from data corruption.

To be supported, all SUSE Linux Enterprise High Availability clusters must have at least one node fencing device configured. For critical workloads, we recommend using two or three fencing devices. A fencing device can be either a physical device (a power switch) or a software mechanism (SBD in combination with a watchdog).

1.1 Components

pacemaker-fenced

The pacemaker-fenced daemon runs on every node in the High Availability cluster. It accepts fencing requests from pacemaker-controld. It can also check the status of the fencing device.

Fence agent

Each type of fencing device can be controlled by a specific fence agent, a stonith-class resource agent that acts as an interface between the cluster and the fencing device. Starting or stopping a fencing resource means registering or deregistering the fencing device with the pacemaker-fenced daemon and does not perform any operation on the device itself. Monitoring a fencing resource means logging in to the device to verify that it works.

Fencing device

The fencing device is the actual physical device that resets or powers off a node when requested by the cluster via the fence agent. The device you use depends on your budget and hardware.

1.2 Fencing devices

Physical devices
  • Power Distribution Units (PDU) are devices with multiple power outlets that can provide remote load monitoring and power recycling.

  • Uninterruptible Power Supplies (UPS) provide emergency power to connected equipment in the event of a power failure.

  • Blade power control devices can be used for fencing if the cluster nodes are running on a set of blades. This device must be capable of managing single-blade computers.

  • Lights-out devices are network-connected devices that allow remote management and monitoring of servers.

Software mechanisms
  • Disk-based SBD fences nodes by exchanging messages via shared block storage. It works together with a watchdog on each node to ensure that misbehaving nodes are really stopped.

  • Diskless SBD fences nodes by using only the watchdog, without a shared storage device. Unlike other node fencing mechanisms, diskless SBD does not need a fence agent.

  • The fence_kdump agent checks if a node is performing a kernel dump (kdump). If a kdump is in progress, the cluster acts as if the node was fenced, because the node will reboot after the kdump is complete. If a kdump is not in progress, the next fencing device fences the node. This fence agent must be used together with a physical fencing device. It cannot be used with SBD.

1.3 For more information

For more information, see https://clusterlabs.org/projects/pacemaker/doc/3.0/Pacemaker_Explained/html/fencing.html.

For a full list of available fence agents, run the crm ra list stonith command.

For details about a specific fence agent, run the crm ra info stonith:fence_AGENT command.

2 What is SBD?

SBD (STONITH Block Device or Storage-Based Death) provides a node fencing mechanism without using an external power-off device. The software component (the SBD daemon) works together with a watchdog device to ensure that misbehaving nodes are fenced. SBD can be used in disk-based mode with shared block storage, or in diskless mode using only the watchdog.

Diskless SBD fences nodes by using only the watchdog, without a shared storage device. A node is fenced if it loses quorum, if any monitored daemon is lost and cannot be recovered, or if Pacemaker determines that the node requires fencing.

2.1 Components

SBD daemon

The SBD daemon starts on each node before the rest of the cluster stack and stops in the reverse order. This ensures that cluster resources are never active without SBD supervision.

Watchdog

SBD needs a watchdog on each node to ensure that misbehaving nodes are really stopped. SBD feeds the watchdog by regularly writing a service pulse to it. If SBD stops feeding the watchdog, the hardware enforces a system restart. This protects against failures of the SBD process itself, such as becoming stuck on an I/O error.

2.2 Limitations and recommendations

Diskless SBD
  • Diskless SBD cannot handle a split-brain scenario for a two-node cluster. This configuration should only be used for clusters with more than two nodes, or in combination with QDevice to help handle split-brain scenarios.

2.3 For more information

For more information, see the man page sbd or run the crm sbd help command.

3 Setting up the SBD watchdog

SBD needs a watchdog on each node to ensure that misbehaving nodes are really stopped. SBD feeds the watchdog by regularly writing a service pulse to it. If SBD stops feeding the watchdog, the hardware enforces a system restart. This protects against failures of the SBD process itself, such as becoming stuck on an I/O error.

Hardware-specific watchdog drivers are available as kernel modules. However, sometimes the wrong watchdog module loads automatically. Use this procedure to make sure the correct module is loaded.

Important
Important: softdog limitations

If no hardware watchdog is available, crmsh automatically configures the software watchdog (softdog) when configuring SBD. This watchdog can be used for testing purposes, but is not recommended for production environments.

The softdog driver assumes that at least one CPU is still running, so if all CPUs are stuck, softdog cannot reboot the system. Hardware watchdogs work even if all CPUs are stuck.

Perform this procedure on all nodes in the cluster:

  1. List the drivers that are installed with your kernel version:

    > rpm -ql kernel-VERSION | grep watchdog

    To help you find the correct driver for your hardware, see Table 1, “Commonly used watchdog drivers”. However, this is not a complete list and might not be accurate for your specific system. Check your system's hardware configuration if possible, or ask your hardware or system vendor for details about system-specific watchdog configuration.

  2. Check whether any watchdog modules are already loaded in the kernel:

    > lsmod | egrep "(wdt|dog)"

    If the correct watchdog module is already loaded, you can skip to Step 7.

  3. If the wrong watchdog module is loaded, you can unload it with the following command:

    > sudo rmmod WRONG_MODULE
  4. Enable the watchdog module that matches your hardware:

    > sudo bash -c "echo WATCHDOG_MODULE > /etc/modules-load.d/watchdog.conf"
    Tip

    If you run this command as the root user, you can omit bash -c and the quotes (""):

    # echo WATCHDOG_MODULE > /etc/modules-load.d/watchdog.conf
  5. Reload the kernel modules:

    > sudo systemctl restart systemd-modules-load
  6. Check whether the watchdog module is loaded correctly:

    > lsmod | egrep "(wdt|dog)"
  7. Verify that at least one watchdog device is available:

    > sudo sbd query-watchdog

    If no watchdog device is available, you might need to use a different driver.

  8. Verify that the watchdog device works:

    > sudo sbd -w /dev/WATCHDOG_DEVICE test-watchdog

    If the test is successful, the node reboots.

Important
Important: Accessing the watchdog timer

SBD must be the only software that accesses the watchdog timer. Some hardware vendors ship systems management software that uses the watchdog for system resets (for example, the HP ASR daemon). If this is the case, disable the additional software.

Table 1: Commonly used watchdog drivers
HardwareDriver
HPhpwdt
Dell, Lenovo (Intel TCO)iTCO_wdt
Fujitsuipmi_watchdog
LPAR on IBM Powerpseries-wdt
VM on IBM z/VMvmwatchdog
VM on VMware vSpherewdat_wdt

4 Setting up diskless SBD

Diskless SBD fences nodes by using only the watchdog, without a shared storage device. However, diskless SBD cannot handle a split-brain scenario for a two-node cluster. This configuration should only be used for clusters with more than two nodes, or in combination with QDevice to help handle split-brain scenarios.

This procedure explains how to configure SBD after the cluster is already installed and running, not during the initial cluster setup.

Important
Important: Cluster restart required

In this procedure, the setup script has to restart the cluster services before it can modify the stonith-watchdog-timeout. Therefore, if any resources are running, you must put the cluster into maintenance mode before running the script. This allows the services managed by the resources to keep running while the cluster restarts. However, be aware that the resources will not have cluster protection while in maintenance mode.

Requirements
  • An existing High Availability cluster is already running.

  • SBD is not configured yet.

  • All nodes have a watchdog device, and the correct watchdog kernel module is loaded.

Perform this procedure on only one cluster node:

  1. Log in either as the root user or as a user with sudo privileges.

  2. Check whether any resources are running:

    > sudo crm status
  3. If any resources are running, put the cluster into maintenance mode:

    > sudo crm maintenance on

    In this state, the cluster stops monitoring all resources. This allows the services managed by the resources to keep running while the cluster restarts. However, be aware that the resources will not have cluster protection while in maintenance mode.

  4. Run the SBD stage of the cluster setup script, using the option --enable-sbd (or -S) to specify diskless SBD:

    > sudo crm cluster init sbd --enable-sbd 
    Additional options
    • If multiple watchdogs are available, you can use the option --watchdog (or -w) to choose which watchdog to use. Specify either the device name (for example, /dev/watchdog1) or the driver name (for example, iTCO_wdt).

    The script updates the SBD configuration file and restarts the cluster services, then updates additional timeout settings. Unlike other node fencing mechanisms, diskless SBD does not need a fence agent.

  5. If the cluster is still in maintenance mode, put it back into normal operation:

    > sudo crm maintenance off
  6. Check the SBD configuration:

    > sudo crm sbd configure show

    The output of this command shows the enabled settings in the /etc/sysconfig/sbd file and the SBD-related cluster settings.

  7. Check the status of SBD:

    > sudo crm sbd status

    The output of this command shows the type of SBD configured, information about the SBD watchdog, and the status of the SBD service.

5 Testing SBD and node fencing

The crm cluster crash_test command simulates cluster failures and reports the results. To test SBD and node fencing, you can run one or more of the tests --fence-node, --kill-sbd and --split-brain-iptables.

The command supports the following checks:

--fence-node NODE

Fences a specific node passed from the command line.

--kill-sbd/--kill-corosync/ --kill-pacemakerd

Kills the daemons for SBD, Corosync, or Pacemaker. After running one of these tests, you can find a report in the directory /var/lib/crmsh/crash_test/. The report includes a test case description, action logging, and an explanation of possible results.

--split-brain-iptables

Simulates a split-brain scenario by blocking the Corosync port, and checks whether one node can be fenced as expected. You must install iptables before you can run this test.

For more information, run the crm cluster crash_test --help command.

This example uses nodes called alice and bob, and tests fencing bob. To watch bob change status during the test, you can log in to Hawk and navigate to Status › Nodes, or run crm status from another node.

Example 1: Manually triggering node fencing
admin@alice> sudo crm cluster crash_test --fence-node bob

==============================================
Testcase:          Fence node bob
Fence action:      reboot
Fence timeout:     95

!!! WARNING WARNING WARNING !!!
THIS CASE MAY LEAD TO NODE BE FENCED.
TYPE Yes TO CONTINUE, OTHER INPUTS WILL CANCEL THIS CASE [Yes/No](No): Yes
INFO: Trying to fence node "bob"
INFO: Waiting 95s for node "bob" reboot...
INFO: Node "bob" will be fenced by "alice"!
INFO: Node "bob" was fenced by "alice" at DATE TIME