Troubleshooting | Administration Guide | SUSE Enterprise Storage 6

Applies to SUSE Enterprise Storage 6

35 Troubleshooting #

This chapter describes several issues that you may face when you operate a Ceph cluster.

35.1 Reporting Software Problems #

If you come across a problem when running SUSE Enterprise Storage 6 related to some of its components, such as Ceph or Object Gateway, report the problem to SUSE Technical Support. The recommended way is with the supportconfig utility.

Tip

Because supportconfig is modular software, make sure that the supportutils-plugin-ses package is installed.

tux > rpm -q supportutils-plugin-ses

If it is missing on the Ceph server, install it with

root # zypper ref && zypper in supportutils-plugin-ses

Although you can use supportconfig on the command line, we recommend using the related YaST module. Find more information about supportconfig in https://documentation.suse.com/sles/15-SP1/html/SLES-all/cha-adm-support.html#sec-admsupport-supportconfig.

35.2 Sending Large Objects with `rados` Fails with Full OSD #

rados is a command line utility to manage RADOS object storage. For more information, see man 8 rados.

If you send a large object to a Ceph cluster with the rados utility, such as

cephadm@adm > rados -p mypool put myobject /file/to/send

it can fill up all the related OSD space and cause serious trouble to the cluster performance.

35.3 Corrupted XFS File system #

In rare circumstances like kernel bug or broken/misconfigured hardware, the underlying file system (XFS) in which an OSD stores its data might be damaged and unmountable.

If you are sure there is no problem with your hardware and the system is configured properly, raise a bug against the XFS subsystem of the SUSE Linux Enterprise Server kernel and mark the particular OSD as down:

cephadm@adm > ceph osd down OSD_ID

Warning: Do Not Format or Otherwise Modify the Damaged Device

Even though using xfs_repair to fix the problem in the file system may seem reasonable, do not use it as the command modifies the file system. The OSD may start but its functioning may be influenced.

Now zap the underlying disk and re-create the OSD by running:

cephadm@osd > ceph-volume lvm zap --data /dev/OSD_DISK_DEVICE
cephadm@osd > ceph-volume lvm prepare --bluestore --data /dev/OSD_DISK_DEVICE

35.4 'Too Many PGs per OSD' Status Message #

If you receive a Too Many PGs per OSD message after running ceph status, it means that the mon_pg_warn_max_per_osd value (300 by default) was exceeded. This value is compared to the number of PGs per OSD ratio. This means that the cluster setup is not optimal.

As of the Nautilus release, the number of PGs can be decreased after a pool is already created. To increase or decrease the amount of PGs for an existing pool, run the following command:

cephadm@adm > ceph osd pool set POOL_NAME NUM_OF_PG

Note that this operation is resource intensive and you are encouraged to make such changes incrementally, especially when decreasing (merging) the amount of PGs. We recommend to instead have the PG autoscaler manager module enabled to manage the amount of PGs per pool. For details, see Section 20.4.12, “PG Auto-scaler”

35.5 'nn pg stuck inactive' Status Message #

If you receive a stuck inactive status message after running ceph status, it means that Ceph does not know where to replicate the stored data to fulfill the replication rules. It can happen shortly after the initial Ceph setup and fix itself automatically. In other cases, this may require a manual interaction, such as bringing up a broken OSD, or adding a new OSD to the cluster. In very rare cases, reducing the replication level may help.

If the placement groups are stuck perpetually, you need to check the output of ceph osd tree. The output should look tree-structured, similar to the example in Section 35.7, “OSD is Down”.

If the output of ceph osd tree is rather flat as in the following example

cephadm@adm > ceph osd tree
ID WEIGHT TYPE NAME    UP/DOWN REWEIGHT PRIMARY-AFFINITY
-1      0 root default
 0      0 osd.0             up  1.00000          1.00000
 1      0 osd.1             up  1.00000          1.00000
 2      0 osd.2             up  1.00000          1.00000

You should check that the related CRUSH map has a tree structure. If it is also flat, or with no hosts as in the above example, it may mean that host name resolution is not working correctly across the cluster.

If the hierarchy is incorrect—for example the root contains hosts, but the OSDs are at the top level and are not themselves assigned to hosts—you will need to move the OSDs to the correct place in the hierarchy. This can be done using the ceph osd crush move and/or ceph osd crush set commands. For further details see Section 20.5, “CRUSH Map Manipulation”.

35.6 OSD Weight is 0 #

When OSD starts, it is assigned a weight. The higher the weight, the bigger the chance that the cluster writes data to the OSD. The weight is either specified in a cluster CRUSH Map, or calculated by the OSDs' start-up script.

In some cases, the calculated value for OSDs' weight may be rounded down to zero. It means that the OSD is not scheduled to store data, and no data is written to it. The reason is usually that the disk is too small (smaller than 15GB) and should be replaced with a bigger one.

35.7 OSD is Down #

OSD daemon is either running, or stopped/down. There are 3 general reasons why an OSD is down:

Hard disk failure.
The OSD crashed.
The server crashed.

You can see the detailed status of OSDs by running

cephadm@adm > ceph osd tree
# id  weight  type name up/down reweight
 -1    0.02998  root default
 -2    0.009995   host doc-ceph1
 0     0.009995      osd.0 up  1
 -3    0.009995   host doc-ceph2
 1     0.009995      osd.1 up  1
 -4    0.009995   host doc-ceph3
 2     0.009995      osd.2 down  1

The example listing shows that the osd.2 is down. Then you may check if the disk where the OSD is located is mounted:

root # lsblk -f
 [...]
 vdb
 ├─vdb1               /var/lib/ceph/osd/ceph-2
 └─vdb2

You can track the reason why the OSD is down by inspecting its log file /var/log/ceph/ceph-osd.2.log. After you find and fix the reason why the OSD is not running, start it with

root # systemctl start ceph-osd@2.service

Do not forget to replace 2 with the actual number of your stopped OSD.

35.8 Finding Slow OSDs #

When tuning the cluster performance, it is very important to identify slow storage/OSDs within the cluster. The reason is that if the data is written to the slow(est) disk, the complete write operation slows down as it always waits until it is finished on all the related disks.

It is not trivial to locate the storage bottleneck. You need to examine each and every OSD to find out the ones slowing down the write process. To do a benchmark on a single OSD, run:

ceph tell osd.OSD_ID_NUMBER bench

For example:

cephadm@adm > ceph tell osd.0 bench
 { "bytes_written": 1073741824,
   "blocksize": 4194304,
   "bytes_per_sec": "19377779.000000"}

Then you need to run this command on each OSD and compare the bytes_per_sec value to get the slow(est) OSDs.

35.9 Fixing Clock Skew Warnings #

The time information in all cluster nodes must be synchronized. If a node's time is not fully synchronized, you may get clock skew warnings when checking the state of the cluster.

By default, DeepSea uses the Admin Node as the time server for other cluster nodes. Therefore, if the Admin Node is not virtualized, select one or more time servers or pools, and synchronize the local time against them. Verify that the time synchronization service is enabled on each system start-up. Find more information on setting up time synchronization in https://www.suse.com/documentation/sles-15/book_sle_admin/data/sec_ntp_yast.html.

If the Admin Node is a virtual machine, provide better time sources for the cluster nodes by overriding the default NTP client configuration:

Edit /srv/pillar/ceph/stack/global.yml on the Salt master node and add the following line:
```
time_server: CUSTOM_NTP_SERVER
```
To add multiple time servers, the format is as follows:
```
time_server:
 - CUSTOM_NTP_SERVER1
 - CUSTOM_NTP_SERVER2
 - CUSTOM_NTP_SERVER3
 [...]
```

Refresh the Salt pillar:

root@master # salt '*' saltutil.pillar_refresh

Verify the changed value:
```
root@master # salt '*' pillar.items
```

Apply the new setting:

root@master # salt '*' state.apply ceph.time

If the time skew still occurs on a node, follow these steps to fix it:

root # systemctl stop chronyd.service
root # systemctl stop ceph-mon.target
root # systemctl start chronyd.service
root # systemctl start ceph-mon.target

You can then check the time offset with chronyc sourcestats.

The Ceph monitors need to have their clocks synchronized to within 0.05 seconds of each other. Refer to Section 33.4, “Time Synchronization of Nodes” for more information.

35.10 Poor Cluster Performance Caused by Network Problems #

There are more reasons why the cluster performance may become weak. One of them can be network problems. In such case, you may notice the cluster reaching quorum, OSD and monitor nodes going offline, data transfers taking a long time, or a lot of reconnect attempts.

To check whether cluster performance is degraded by network problems, inspect the Ceph log files under the /var/log/ceph directory.

To fix network issues on the cluster, focus on the following points:

Basic network diagnostics. Try DeepSea diagnostics tools runner net.ping to ping between cluster nodes to see if individual interface can reach to specific interface and the average response time. Any specific response time much slower then average will also be reported. For example:
```
root@master # salt-run net.ping
  Succeeded: 8 addresses from 7 minions average rtt 0.15 ms
```
Try validating all interface with JumboFrame enable:
```
root@master # salt-run net.jumbo_ping
  Succeeded: 8 addresses from 7 minions average rtt 0.26 ms
```
Network performance benchmark. Try DeepSea's network performance runner net.iperf to test intern-node network bandwidth. On a given cluster node, a number of iperf processes (according to the number of CPU cores) are started as servers. The remaining cluster nodes will be used as clients to generate network traffic. The accumulated bandwidth of all per-node iperf processes is reported. This should reflect the maximum achievable network throughput on all cluster nodes. For example:
```
root@master # salt-run net.iperf cluster=ceph output=full
192.168.128.1:
    8644.0 Mbits/sec
192.168.128.2:
    10360.0 Mbits/sec
192.168.128.3:
    9336.0 Mbits/sec
192.168.128.4:
    9588.56 Mbits/sec
192.168.128.5:
    10187.0 Mbits/sec
192.168.128.6:
    10465.0 Mbits/sec
```
Check firewall settings on cluster nodes. Make sure they do not block ports/protocols required by Ceph operation. See Section 33.9, “Firewall Settings for Ceph” for more information on firewall settings.
Check the networking hardware, such as network cards, cables, or switches, for proper operation.

Tip: Separate Network

To ensure fast and safe network communication between cluster nodes, set up a separate network used exclusively by the cluster OSD and monitor nodes.

35.11 `/var` Running Out of Space #

By default, the Salt master saves every minion's return for every job in its job cache. The cache can then be used later to look up results from previous jobs. The cache directory defaults to /var/cache/salt/master/jobs/.

Each job return from every minion is saved in a single file. Over time this directory can grow very large, depending on the number of published jobs and the value of the keep_jobs option in the /etc/salt/master file. keep_jobs sets the number of hours (24 by default) to keep information about past minion jobs.

keep_jobs: 24

Important: Do Not Set `keep_jobs: 0`

Setting keep_jobs to '0' will cause the job cache cleaner to never run, possibly resulting in a full partition.

If you want to disable the job cache, set job_cache to 'False':

job_cache: False

Tip: Restoring Partition Full because of Job Cache

When the partition with job cache files gets full because of wrong keep_jobs setting, follow these steps to free disk space and improve the job cache settings:

Stop the Salt master service:

root@master # systemctl stop salt-master

Change the Salt master configuration related to job cache by editing /etc/salt/master:
```
job_cache: False
keep_jobs: 1
```

Clear the Salt master job cache:

root # rm -rfv /var/cache/salt/master/jobs/*

Start the Salt master service:

root@master # systemctl start salt-master

35.12 OSD Panic Occurs when Media Error Happens during FileStore Directory Split #

When a directory split error is generated on FileStore OSDs, the corresponding OSD is designed to terminate. It is then easier to detect the problem and avoids introducing inconsistencies. If the OSD terminates frequently, systemd may disable it permanently depending on the systemd configuration. After the OSD is disabled, it will be set out and process of data migration will be started.

You should notice the OSD termination by regularly running the ceph status command and perform necessary steps to investigate the cause. If media error is the cause of OSD termination, replace the OSD and wait for all PGs to complete their backfill to the new replaced OSD. Then run the deep-scrub for these PGs.

Important

Do not initiate a new deep-scrub by adding new OSDs or changing their weight until the deep-scrub is complete.

For more details about scrubbing, refer to Section 33.2, “Adjusting Scrubbing”.