This is a draft document that was built and uploaded automatically. It may document beta software and be incomplete or even incorrect. Use this document at your own risk.

Jump to content
Introduction to SUSE Linux Enterprise High Availability
SUSE Linux Enterprise High Availability 16.0

Introduction to SUSE Linux Enterprise High Availability

Publication Date: 24 Oct 2025
WHAT?

This article explains the capabilities, architecture, core concepts and benefits of SUSE Linux Enterprise High Availability.

WHY?

Learn about SUSE Linux Enterprise High Availability to help you decide whether to use it, or before you set up a cluster for the first time.

EFFORT

Approximately 20 minutes of reading time.

1 What is SUSE Linux Enterprise High Availability?

SUSE® Linux Enterprise High Availability is an integrated suite of open source clustering technologies. A high-availability cluster is a group of servers (nodes) that work together to ensure the highest possible availability of data, applications and services. If one node fails, the resources that were running on it move to another node with little or no downtime. You can also manually move resources between nodes for load balancing, or to perform maintenance tasks with minimal downtime. Cluster resources can include Web sites, e-mail servers, databases, file systems, virtual machines and any other server-based applications or services that must be available to users at all times.

1.1 Product availability

SUSE Linux Enterprise High Availability is available with the following products:

SUSE Linux Enterprise Server

High Availability is available as an extension. This requires an additional registration code.

SUSE Linux Enterprise Server for SAP applications

High Availability is included as a module. No additional registration code is required.

1.2 Capabilities and benefits

SUSE Linux Enterprise High Availability eliminates single points of failure to improve the availability and manageability of critical resources. This helps you maintain business continuity, protect data integrity, and reduce unplanned downtime for critical workloads.

Multiple clustering options

SUSE Linux Enterprise High Availability clusters can be configured in different ways:

  • Local clusters: a single cluster in one location (for example, all nodes are located in one data center). Network latency is minimal. Storage is typically accessed synchronously by all nodes.

  • Metro clusters (stretched clusters): a single cluster that can stretch over multiple buildings or data centers, with all sites connected by Fibre Channel. Network latency is usually low. Storage is frequently replicated using mirroring or synchronous replication.

  • Hybrid clusters: virtual servers can be clustered together with physical servers. This improves service availability and resource usage.

Important
Important: No support for mixed architectures

Clusters with mixed architectures are not supported. All nodes in the same cluster must have the same processor platform: AMD64/Intel 64, IBM Z, or POWER.

Flexible and scalable resource management

SUSE Linux Enterprise High Availability supports clusters of up to 32 nodes. Resources can automatically move to another node if the current node fails, or they can be moved manually to troubleshoot hardware or balance the workload. Resources can also be configured to move back to repaired nodes at a specific time. The cluster can stop and start resources based on configurable rules.

Wide range of resource agents

The cluster manages resources via resource agents (RAs). SUSE Linux Enterprise High Availability supports many different resource agents that are designed to manage specific types of applications or services, including Apache, IPv4, IPv6, NFS and many more. It also ships with resource agents for third-party applications such as IBM WebSphere Application Server.

Storage and data replication

SUSE Linux Enterprise High Availability supports Fibre Channel or iSCSI storage area networks (SANs), allowing you to dynamically assign and reassign server storage as needed. It also comes with GFS2 and the cluster Logical Volume Manager (Cluster LVM). For data replication, use DRBD* to mirror a resource's data from the active node to a standby node.

Support for virtualized environments

SUSE Linux Enterprise High Availability supports the mixed clustering of both physical and virtual Linux servers. The cluster can recognize and manage resources running in virtual servers and in physical servers, and can also manage KVM virtual machines as resources.

Disaster recovery

SUSE Linux Enterprise High Availability ships with Relax-and-Recover (ReaR), a disaster recovery framework that helps to back up and restore systems.

User-friendly administration tools

SUSE Linux Enterprise High Availability includes tools for configuration and administration:

  • The CRM Shell (crmsh) is a command-line interface for installing and setting up High Availability clusters, configuring resources, and performing monitoring and administration tasks.

  • Hawk is a Web-based graphical interface for monitoring and administration of High Availability clusters. It can be accessed using a Web browser from any Linux or non-Linux machine that can connect to the cluster nodes.

1.3 How does High Availability work?

The following figure shows a three-node cluster. Each of the nodes has a Web server installed and hosts two Web sites. All the data, graphics and Web page content for each Web site are stored on a shared disk subsystem connected to each of the nodes.

This diagram shows three cluster nodes: Web server 1, Web server 2, and Web server 3. Each Web server has two Web sites on it. Each Web server is also connected to a Fibre Channel switch, which is then connected to shared storage.
Figure 1: Three-node cluster

During normal cluster operation, each node is in constant communication with the other nodes in the cluster and periodically checks resources to detect failure.

The following figure shows how the cluster moves resources when Web server 1 fails due to hardware or software problems.

This diagram shows three cluster nodes: Web server 1, Web server 2, and Web server 3. Web server 1 is crossed out. Web server 2 and Web server 3 each have three Web sites. Web site A and Web site B are highlighted in orange to show that they moved away from the crossed-out Web server 1.
Figure 2: Three-node cluster after one node fails

Web site A moves to Web server 2 and Web site B moves to Web server 3. The Web sites continue to be available and are evenly distributed between the remaining cluster nodes.

When Web server 1 failed, the High Availability software performed the following tasks:

  1. Detected a failure and verified that Web server 1 was really down.

  2. Remounted the shared data directories that were mounted on Web server 1 on Web servers 2 and 3.

  3. Restarted applications that were running on Web server 1 on Web servers 2 and 3.

  4. Transferred certificates and IP addresses to Web servers 2 and 3.

In this example, failover happened quickly and users regained access to the Web sites within seconds, usually without needing to log in again.

When Web server 1 returns to a normal operating state, Web site A and Web site B can move back (fail back) to Web server 1 automatically. This incurs some downtime, so alternatively you can configure resources to only fail back at a specified time that causes minimal service interruption.

2 Core concepts

This section explains the core concepts of SUSE Linux Enterprise High Availability.

2.1 Clusters and nodes

A high-availability cluster is a group of servers that work together to ensure the availability of applications or services. Servers that are configured as members of the cluster are called nodes. If one node fails, the resources running on it can move to another node in the cluster. SUSE Linux Enterprise High Availability supports clusters with up to 32 nodes. However, clusters usually fall into one of two categories: two-node clusters, or clusters with an odd number of nodes (typically three or five).

2.2 Communication channels

Internal cluster communication is handled by Corosync. The Corosync Cluster Engine is a group communication system that provides messaging, membership and quorum information about the cluster. In SUSE Linux Enterprise High Availability 16, Corosync uses kronosnet (knet) as the default transport protocol.

We highly recommend configuring at least two communication channels for the cluster. The preferred method is to use network device bonding. If you cannot use a network bond, you can set up a redundant communication channel in Corosync (also known as a second ring).

2.3 Resource management

In a High Availability cluster, the applications and services that need to be highly available are called resources. Cluster resources can include Web sites, e-mail servers, databases, file systems, virtual machines and any other server-based applications or services you want to make available to users at all times. You can start, stop, monitor and move resources as needed. You can also specify whether specific resources should run together on the same node, or start and stop in sequential order. If a cluster node fails, the resources running on it fail over (move) to another node instead of being lost.

In SUSE Linux Enterprise High Availability, the cluster resource manager (CRM) is Pacemaker, which manages and coordinates all cluster services. Pacemaker uses resource agents (RAs) to start, stop and monitor resources. A resource agent abstracts the resource it manages and presents its status to the cluster. SUSE Linux Enterprise High Availability supports many different resource agents that are designed to manage specific types of applications or services.

2.4 Node fencing

In a split-brain scenario, cluster nodes are divided into two or more groups (or partitions) that do not know about each other. This might be because of a hardware or software failure, or a failed network connection, for example. A split-brain scenario can be resolved by fencing (resetting or powering off) one or more of the nodes. Node fencing prevents a failed node from accessing shared resources and prevents cluster resources from running on a node with an uncertain status. This helps protect the cluster from data corruption.

SUSE Linux Enterprise High Availability uses STONITH as the node fencing mechanism. To be supported, all SUSE Linux Enterprise High Availability clusters must have at least one STONITH device. For critical workloads, we recommend using two or three STONITH devices. A STONITH device can be either a physical device (a power switch) or a software mechanism (SBD in combination with a watchdog).

2.5 Quorum calculation

When communication fails between one or more nodes and the rest of the cluster (a split-brain scenario), a cluster partition occurs. The nodes can only communicate with other nodes in the same partition and are unaware of the separated nodes. A cluster partition has quorum (or is quorate) if it has the majority of nodes (or votes). This is determined by quorum calculation. Quorum must be calculated so the non-quorate nodes can be fenced.

Corosync calculates quorum based on the following formula:

N ≥ C/2 + 1

N = minimum number of operational nodes
C = number of cluster nodes

For example, a five-node cluster needs a minimum of three operational nodes to maintain quorum.

Clusters with an even number of nodes, especially two-node clusters, might have equal numbers of nodes in each partition and therefore no majority. To avoid this situation, you can configure the cluster to use QDevice in combination with QNetd. QNetd is an arbitrator running outside the cluster. It communicates with the QDevice daemon running on the cluster nodes to provide a configurable number of votes for quorum calculation. This lets a cluster sustain more node failures than the standard quorum rules allow.

2.6 Storage and data replication

High Availability clusters might include a shared disk subsystem connected via Fibre Channel or iSCSI. If a node fails, another node in the cluster automatically mounts the shared disk directories that were previously mounted on the failed node. This gives users continuous access to the directories on the shared disk subsystem. Content stored on the shared disk might include data, applications and services.

The following figure shows what a typical Fibre Channel cluster configuration might look like. The green lines depict connections to an Ethernet power switch, which can reboot a node if a ping request fails.

This diagram shows shared storage connected to a Fibre Channel switch. The switch is then connected to six servers. Each of the six servers is also connected to a network hub, which is then connected to an Ethernet power switch.
Figure 3: Typical Fibre Channel cluster configuration

The following figure shows what a typical iSCSI cluster configuration might look like.

This diagram shows shared storage connected to an Ethernet switch. The switch is then connected to six servers. Each of the six servers is also connected to a network hub, which is then connected to an Ethernet power switch and backed by a network backbone.
Figure 4: Typical iSCSI cluster configuration

Although most clusters include a shared disk subsystem, you can also create a cluster without a shared disk subsystem. The following figure shows what a cluster without a shared disk subsystem might look like.

This diagram shows an Ethernet power switch connected to a network hub. The hub is then connected to six servers.
Figure 5: Typical cluster configuration without shared storage

3 Architecture overview

This section explains the architecture of a SUSE Linux Enterprise High Availability cluster and how the different components interoperate.

3.1 Architecture layers

SUSE Linux Enterprise High Availability has a layered architecture. The following figure shows the different layers and their associated components.

This diagram shows the layers of components on two cluster nodes. Some components are local to the node, and some communicate across nodes, such as Corosync and the CIB.
Figure 6: Architecture of a SUSE Linux Enterprise High Availability cluster
Membership and messaging (Corosync)

The Corosync Cluster Engine is a group communication system that provides messaging, membership and quorum information about the cluster.

Cluster resource manager (Pacemaker)

Pacemaker is the cluster resource manager that reacts to events occurring in the cluster. Events might be nodes that join or leave the cluster, failure of resources, or scheduled activities such as maintenance, for example. The pacemakerd daemon launches and monitors all other related daemons.

The following components are also part of the Pacemaker layer:

  • Cluster Information Database (CIB)

    On every node, Pacemaker maintains the cluster information database (CIB). This is an XML representation of the cluster configuration (including cluster options, nodes, resources, constraints and the relationships to each other). The CIB also reflects the current cluster status. The CIB manager (pacemaker-based) keeps the CIB synchronized across the cluster and handles reading and writing cluster configuration and status.

  • Designated Coordinator (DC)

    The pacemaker-controld daemon is the cluster controller, which coordinates all actions. This daemon has an instance on each cluster node, but only one instance is elected to act as the DC. The DC is elected when the cluster services start, or if the current DC fails or leaves the cluster. The DC decides whether a cluster-wide change must be performed, such as fencing a node or moving resources.

  • Scheduler

    The scheduler runs on every node as the pacemaker-schedulerd daemon, but is only active on the DC. When a cluster transition is needed, the scheduler calculates the expected next state of the cluster and determines what actions need to be scheduled to achieve that state.

Local executor

The local executor is located between the Pacemaker layer and the resources layer on each node. The pacemaker-execd daemon allows Pacemaker to start, stop and monitor resources.

Resources and resource agents

In a High Availability cluster, the services that need to be highly available are called resources. Resource agents (RAs) are scripts that start, stop and monitor cluster resources.

3.2 Process flow

Many actions performed in the cluster cause a cluster-wide change, for example, adding or removing a cluster resource or changing resource constraints. The following example explains what happens in the cluster when you perform such an action:

Example 1: Cluster process when you add a new resource
  1. You use the CRM Shell or Hawk to add a new cluster resource. You can do this from any node in the cluster. Adding the resource modifies the CIB.

  2. The CIB change is replicated to all cluster nodes.

  3. Based on the information in the CIB, pacemaker-schedulerd calculates the ideal state of the cluster and how it should be achieved. It feeds a list of instructions to the DC.

  4. The DC sends commands via Corosync, which are received by the pacemaker-controld instances on the other nodes.

  5. Each node uses its local executor (pacemaker-execd) to perform resource modifications. The pacemaker-execd daemon is not cluster-aware and interacts directly with resource agents.

  6. All nodes report the results of their operations back to the DC.

  7. If fencing is required, pacemaker-fenced calls the fencing agent to fence the node.

  8. After the DC concludes that all necessary operations have been successfully performed, the cluster returns to an idle state and waits for further events.

  9. If any operation was not carried out as planned, pacemaker-schedulerd is invoked again with the new information recorded in the CIB.

4 Installation options

This section describes the different options for installing a SUSE Linux Enterprise High Availability cluster and includes links to the available installation guides.

The following quick start guides explain how to set up a minimal cluster with default settings:

Installing a Basic Two-Node High Availability Cluster

This quick start guide describes how to set up a basic two-node High Availability cluster with QDevice, diskless SBD and a software watchdog. QDevice is required for this setup so that diskless SBD can handle split-brain scenarios for the two-node cluster.

Installing a Basic Three-Node High Availability Cluster

This quick start guide describes how to set up a basic three-node High Availability cluster with diskless SBD and a software watchdog. Three nodes are required for this setup so that diskless SBD can handle split-brain scenarios without the help of QDevice.

5 For more information

For more information about SUSE Linux Enterprise High Availability, see the following resources:

https://clusterlabs.org/

The upstream project for many of the components in SUSE Linux Enterprise High Availability.

https://www.clusterlabs.org/pacemaker/doc/

Documentation for Pacemaker. For SUSE Linux Enterprise High Availability 16, see the documentation for Pacemaker 3.

HA glossary

active/active, active/passive

How resources run on the nodes. Active/passive means that resources only run on the active node, but can move to the passive node if the active node fails. Active/active means that all nodes are active at once, and resources can run on (and move to) any node in the cluster.

arbitrator

An arbitrator is a machine running outside the cluster to provide an additional instance for cluster calculations. For example, QNetd provides a vote to help QDevice participate in quorum decisions.

CIB (cluster information base)

An XML representation of the whole cluster configuration and status (cluster options, nodes, resources, constraints and the relationships to each other). The CIB manager (pacemaker-based) keeps the CIB synchronized across the cluster and handles requests to modify it.

clone

A clone is an identical copy of an existing node, used to make deploying multiple nodes simpler.

In the context of a cluster resource, a clone is a resource that can be active on multiple nodes. Any resource can be cloned if its resource agent supports it.

cluster

A high-availability cluster is a group of servers (physical or virtual) designed primarily to secure the highest possible availability of data, applications and services. Not to be confused with a high-performance cluster, which shares the application load to achieve faster results.

Cluster logical volume manager (Cluster LVM)

The term Cluster LVM indicates that LVM is being used in a cluster environment. This requires configuration adjustments to protect the LVM metadata on shared storage.

cluster partition

A cluster partition occurs when communication fails between one or more nodes and the rest of the cluster. The nodes are split into partitions but are still active. They can only communicate with nodes in the same partition and are unaware of the separated nodes. This is known as a split brain scenario.

cluster stack

The ensemble of software technologies and components that make up a cluster.

colocation constraint

A type of resource constraint that specifies which resources can or cannot run together on a node.

concurrency violation

A resource that should be running on only one node in the cluster is running on several nodes.

Corosync

Corosync provides reliable messaging, membership and quorum information about the cluster. This is handled by the Corosync Cluster Engine, a group communication system.

CRM (cluster resource manager)

The management entity responsible for coordinating all non-local interactions in a High Availability cluster. SUSE Linux Enterprise High Availability uses Pacemaker as the CRM. It interacts with several components: local executors on its own node and on the other nodes, non-local CRMs, administrative commands, the fencing functionality, and the membership layer.

crmsh (CRM Shell)

The command-line utility crmsh manages the cluster, nodes and resources.

Csync2

A synchronization tool for replicating configuration files across all nodes in the cluster.

DC (designated coordinator)

The pacemaker-controld daemon is the cluster controller, which coordinates all actions. This daemon has an instance on each cluster node, but only one instance is elected to act as the DC. The DC is elected when the cluster services start, or if the current DC fails or leaves the cluster. The DC decides whether a cluster-wide change must be performed, such as fencing a node or moving resources.

disaster

An unexpected interruption of critical infrastructure caused by nature, humans, hardware failure, or software bugs.

disaster recovery

The process by which a function is restored to the normal, steady state after a disaster.

Disaster Recovery Plan

A strategy to recover from a disaster with the minimum impact on IT infrastructure.

DLM (Distributed Lock Manager)

DLM coordinates accesses to shared resources in a cluster, for example, managing file locking in clustered file systems to increase performance and availability.

DRBD

DRBD® is a block device designed for building High Availability clusters. It replicates data on a primary device to secondary devices in a way that ensures all copies of the data remain identical.

existing cluster

The term existing cluster is used to refer to any cluster that consists of at least one node. An existing cluster has a basic Corosync configuration that defines the communication channels, but does not necessarily have resource configuration yet.

failover

Occurs when a resource or node fails on one machine and the affected resources move to another node.

failover domain

A named subset of cluster nodes that are eligible to run a resource if a node fails.

fencing

Prevents access to a shared resource by isolated or failing cluster members. There are two classes of fencing: resource-level fencing and node-level fencing. Resource-level fencing ensures exclusive access to a resource. Node-level fencing prevents a failed node from accessing shared resources and prevents resources from running on a node with an uncertain status. This is usually done by resetting or powering off the node.

GFS2

Global File System 2 (GFS2) is a shared disk file system for Linux computer clusters. GFS2 allows all nodes to have direct concurrent access to the same shared block storage. GFS2 has no disconnected operating mode, and no client or server roles. All nodes in a GFS2 cluster function as peers. GFS2 supports up to 32 cluster nodes. Using GFS2 in a cluster requires hardware to allow access to the shared storage, and a lock manager to control access to the storage.

group

Resource groups contain multiple resources that need to be located together, started sequentially and stopped in the reverse order.

Hawk (HA Web Konsole)

A user-friendly Web-based interface for monitoring and administering a High Availability cluster from Linux or non-Linux machines. Hawk can be accessed from any machine that can connect to the cluster nodes, using a graphical Web browser.

heuristics

QDevice supports using a set of commands (heuristics) that run locally on start-up of cluster services, cluster membership change, successful connection to the QNetd server, or optionally at regular times. The result is used in calculations to determine which partition should have quorum.

knet (kronosnet)

A network abstraction layer supporting redundancy, security, fault tolerance, and fast fail-over of network links. In SUSE Linux Enterprise High Availability 16, knet is the default transport protocol for the Corosync communication channels.

local cluster

A single cluster in one location (for example, all nodes are located in one data center). Network latency is minimal. Storage is typically accessed synchronously by all nodes.

local executor

The local executor is located between Pacemaker and the resources on each node. Through the pacemaker-execd daemon, Pacemaker can start, stop and monitor resources.

location

In the context of a whole cluster, location can refer to the physical location of nodes (for example, all nodes might be located in the same data center). In the context of a location constraint, location refers to the nodes on which a resource can or cannot run.

location constraint

A type of resource constraint that defines the nodes on which a resource can or cannot run.

meta attributes (resource options)

Parameters that tell the CRM (cluster resource manager) how to treat a specific resource. For example, you might define a resource's priority or target role.

metro cluster

A single cluster that can stretch over multiple buildings or data centers, with all sites connected by Fibre Channel. Network latency is usually low. Storage is frequently replicated using mirroring or synchronous replication.

network device bonding

Network device bonding combines two or more network interfaces into a single bonded device to increase bandwidth and/or provide redundancy. When using Corosync, the bonded device is not managed by the cluster software. Therefore, the bonded device must be configured on every cluster node that might need to access it.

node

Any server (physical or virtual) that is a member of a cluster.

order constraint

A type of resource constraint that defines the sequence of actions.

Pacemaker

Pacemaker is the CRM (cluster resource manager) in SUSE Linux Enterprise High Availability, or the brain that reacts to events occurring in the cluster. Events might be nodes that join or leave the cluster, failure of resources, or scheduled activities such as maintenance, for example. The pacemakerd daemon launches and monitors all other related daemons.

parameters (instance attributes)

Parameters determine which instance of a service the resource controls.

primitive

A primitive resource is the most basic type of cluster resource.

promotable clone

Promotable clones are a special type of clone resource that can be promoted. Active instances of these resources are divided into two states: promoted and unpromoted (also known as active and passive or primary and secondary).

QDevice

QDevice and QNetd participate in quorum decisions. The corosync-qdevice daemon runs on each cluster node and communicates with QNetd to provide a configurable number of votes, allowing a cluster to sustain more node failures than the standard quorum rules allow.

QNetd

QNetd is an arbitrator that runs outside the cluster. The corosync-qnetd daemon provides a vote to the corosync-qdevice daemon on each node to help it participate in quorum decisions.

quorum

A cluster partition is defined to have quorum (be quorate) if it has the majority of nodes (or votes). Quorum distinguishes exactly one partition. This is part of the algorithm to prevent several disconnected partitions or nodes (split brain) from proceeding and causing data and service corruption. Quorum is a prerequisite for fencing, which then ensures that quorum is unique.

RA (resource agent)

A script acting as a proxy to manage a resource (for example, to start, stop or monitor a resource). SUSE Linux Enterprise High Availability supports different kinds of resource agents.

ReaR (Relax and Recover)

An administrator tool set for creating disaster recovery images.

resource

Any type of service or application that is known to Pacemaker, for example, an IP address, a file system, or a database. The term resource is also used for DRBD, where it names a set of block devices that use a common connection for replication.

resource constraint

Resource constraints specify which cluster nodes resources can run on, what order resources load in, and what other resources a specific resource is dependent on.

See also colocation constraint, location constraint and order constraint.

resource set

As an alternative format for defining location, colocation or order constraints, you can use resource sets, where primitives are grouped together in one set. When creating a constraint, you can specify multiple resources for the constraint to apply to.

resource template

To help create many resources with similar configurations, you can define a resource template. After being defined, it can be referenced in primitives or in certain types of constraints. If a template is referenced in a primitive, the primitive inherits all operations, instance attributes (parameters), meta attributes and utilization attributes defined in the template.

SBD (STONITH Block Device)

SBD provides a node fencing mechanism through the exchange of messages via shared block storage. Alternatively, it can be used in diskless mode. In either case, it needs a hardware or software watchdog on each node to ensure that misbehaving nodes are really stopped.

scheduler

The scheduler is implemented as pacemaker-schedulerd. When a cluster transition is needed, pacemaker-schedulerd calculates the expected next state of the cluster and determines what actions need to be scheduled to achieve the next state.

split brain

A scenario in which the cluster nodes are divided into two or more groups that do not know about each other (either through a software or hardware failure). STONITH prevents a split-brain scenario from badly affecting the entire cluster. Also known as a partitioned cluster scenario.

The term split brain is also used in DRBD but means that the nodes contain different data.

SPOF (single point of failure)

Any component of a cluster that, if it fails, triggers the failure of the entire cluster.

STONITH

An acronym for shoot the other node in the head. It refers to the fencing mechanism that shuts down a misbehaving node to prevent it from causing trouble in a cluster. In a Pacemaker cluster, STONITH is managed by the fencing subsystem pacemaker-fenced.

switchover

The planned moving of resources to other nodes in a cluster. See also failover.

utilization

Tells the CRM what capacity a certain resource requires from a node.

watchdog

SBD (STONITH Block Device) needs a watchdog on each node to ensure that misbehaving nodes are really stopped. SBD feeds the watchdog by regularly writing a service pulse to it. If SBD stops feeding the watchdog, the hardware enforces a system restart. This protects against failures of the SBD process itself, such as becoming stuck on an I/O error.