Troubleshooting | Administration Guide | SUSE Enterprise Storage 5.5 (SES 5 & SES 5.5)

Applies to SUSE Enterprise Storage 5.5 (SES 5 & SES 5.5)

22 Troubleshooting #

This chapter describes several issues that you may face when you operate a Ceph cluster.

22.1 Reporting Software Problems #

If you come across a problem when running SUSE Enterprise Storage 5.5 related to some of its components, such as Ceph or Object Gateway, report the problem to SUSE Technical Support. The recommended way is with the supportconfig utility.

Tip

Because supportconfig is modular software, make sure that the supportutils-plugin-ses package is installed.

cephadm > rpm -q supportutils-plugin-ses

If it is missing on the Ceph server, install it with

root # zypper ref && zypper in supportutils-plugin-ses

Although you can use supportconfig on the command line, we recommend using the related YaST module. Find more information about supportconfig in https://documentation.suse.com/sles/12-SP5/single-html/SLES-admin/#sec-admsupport-supportconfig.

22.2 Sending Large Objects with `rados` Fails with Full OSD #

rados is a command line utility to manage RADOS object storage. For more information, see man 8 rados.

If you send a large object to a Ceph cluster with the rados utility, such as

cephadm > rados -p mypool put myobject /file/to/send

it can fill up all the related OSD space and cause serious trouble to the cluster performance.

22.3 Corrupted XFS File system #

In rare circumstances like kernel bug or broken/misconfigured hardware, the underlying file system (XFS) in which an OSD stores its data might be damaged and unmountable.

If you are sure there is no problem with your hardware and the system is configured properly, raise a bug against the XFS subsystem of the SUSE Linux Enterprise Server kernel and mark the particular OSD as down:

cephadm > ceph osd down OSD identification

Warning: Do Not Format or Otherwise Modify the Damaged Device

Even though using xfs_repair to fix the problem in the file system may seem reasonable, do not use it as the command modifies the file system. The OSD may start but its functioning may be influenced.

Now zap the underlying disk and re-create the OSD by running:

cephadm > ceph-disk prepare --zap $OSD_DISK_DEVICE $OSD_JOURNAL_DEVICE"

for example:

cephadm > ceph-disk prepare --zap /dev/sdb /dev/sdd2

22.4 'Too Many PGs per OSD' Status Message #

If you receive a Too Many PGs per OSD message after running ceph status, it means that the mon_pg_warn_max_per_osd value (300 by default) was exceeded. This value is compared to the number of PGs per OSD ratio. This means that the cluster setup is not optimal.

The number of PGs cannot be reduced after the pool is created. Pools that do not yet contain any data can safely be deleted and then re-created with a lower number of PGs. Where pools already contain data, the only solution is to add OSDs to the cluster so that the ratio of PGs per OSD becomes lower.

22.5 'nn pg stuck inactive' Status Message #

If you receive a stuck inactive status message after running ceph status, it means that Ceph does not know where to replicate the stored data to fulfill the replication rules. It can happen shortly after the initial Ceph setup and fix itself automatically. In other cases, this may require a manual interaction, such as bringing up a broken OSD, or adding a new OSD to the cluster. In very rare cases, reducing the replication level may help.

If the placement groups are stuck perpetually, you need to check the output of ceph osd tree. The output should look tree-structured, similar to the example in Section 22.7, “OSD is Down”.

If the output of ceph osd tree is rather flat as in the following example

cephadm > ceph osd tree
ID WEIGHT TYPE NAME    UP/DOWN REWEIGHT PRIMARY-AFFINITY
-1      0 root default
 0      0 osd.0             up  1.00000          1.00000
 1      0 osd.1             up  1.00000          1.00000
 2      0 osd.2             up  1.00000          1.00000

You should check that the related CRUSH map has a tree structure. If it is also flat, or with no hosts as in the above example, it may mean that host name resolution is not working correctly across the cluster.

If the hierarchy is incorrect—for example the root contains hosts, but the OSDs are at the top level and are not themselves assigned to hosts—you will need to move the OSDs to the correct place in the hierarchy. This can be done using the ceph osd crush move and/or ceph osd crush set commands. For further details see Section 7.4, “CRUSH Map Manipulation”.

22.6 OSD Weight is 0 #

When OSD starts, it is assigned a weight. The higher the weight, the bigger the chance that the cluster writes data to the OSD. The weight is either specified in a cluster CRUSH Map, or calculated by the OSDs' start-up script.

In some cases, the calculated value for OSDs' weight may be rounded down to zero. It means that the OSD is not scheduled to store data, and no data is written to it. The reason is usually that the disk is too small (smaller than 15GB) and should be replaced with a bigger one.

22.7 OSD is Down #

OSD daemon is either running, or stopped/down. There are 3 general reasons why an OSD is down:

Hard disk failure.
The OSD crashed.
The server crashed.

You can see the detailed status of OSDs by running

cephadm > ceph osd tree
# id  weight  type name up/down reweight
 -1    0.02998  root default
 -2    0.009995   host doc-ceph1
 0     0.009995      osd.0 up  1
 -3    0.009995   host doc-ceph2
 1     0.009995      osd.1 up  1
 -4    0.009995   host doc-ceph3
 2     0.009995      osd.2 down  1

The example listing shows that the osd.2 is down. Then you may check if the disk where the OSD is located is mounted:

root # lsblk -f
 [...]
 vdb
 ├─vdb1               /var/lib/ceph/osd/ceph-2
 └─vdb2

You can track the reason why the OSD is down by inspecting its log file /var/log/ceph/ceph-osd.2.log. After you find and fix the reason why the OSD is not running, start it with

root # systemctl start ceph-osd@2.service

Do not forget to replace 2 with the actual number of your stopped OSD.

22.8 Finding Slow OSDs #

When tuning the cluster performance, it is very important to identify slow storage/OSDs within the cluster. The reason is that if the data is written to the slow(est) disk, the complete write operation slows down as it always waits until it is finished on all the related disks.

It is not trivial to locate the storage bottleneck. You need to examine each and every OSD to find out the ones slowing down the write process. To do a benchmark on a single OSD, run:

ceph tell osd.OSD_ID_NUMBER bench

For example:

cephadm > ceph tell osd.0 bench
 { "bytes_written": 1073741824,
   "blocksize": 4194304,
   "bytes_per_sec": "19377779.000000"}

Then you need to run this command on each OSD and compare the bytes_per_sec value to get the slow(est) OSDs.

22.9 Fixing Clock Skew Warnings #

The time information in all cluster nodes must be synchronized. If a node's time is not fully synchronized, you may get clock skew warnings when checking the state of the cluster.

Time synchronization is managed with NTP (see http://en.wikipedia.org/wiki/Network_Time_Protocol). Set each node to synchronize its time with one or more NTP servers, preferably to the same group of NTP servers. If the time skew still occurs on a node, follow these steps to fix it:

root # systemctl stop ntpd.service
root # systemctl stop ceph-mon.target
root # systemctl start ntpd.service
root # systemctl start ceph-mon.target

You can then query the NTP peers and check the time offset with sudo ntpq -p.

The Ceph monitors need to have their clocks synchronized to within 0.05 seconds of each other. Refer to Section 20.4, “Time Synchronization of Nodes” for more information.

22.10 Poor Cluster Performance Caused by Network Problems #

There are more reasons why the cluster performance may become weak. One of them can be network problems. In such case, you may notice the cluster reaching quorum, OSD and monitor nodes going offline, data transfers taking a long time, or a lot of reconnect attempts.

To check whether cluster performance is degraded by network problems, inspect the Ceph log files under the /var/log/ceph directory.

To fix network issues on the cluster, focus on the following points:

Basic network diagnostics. Try DeepSea diagnostics tools runner net.ping to ping between cluster nodes to see if individual interface can reach to specific interface and the average response time. Any specific response time much slower then average will also be reported. For example:
```
root@master # salt-run net.ping
  Succeeded: 8 addresses from 7 minions average rtt 0.15 ms
```
Try validating all interface with JumboFrame enable:
```
root@master # salt-run net.jumbo_ping
  Succeeded: 8 addresses from 7 minions average rtt 0.26 ms
```
Network performance benchmark. Try DeepSea's network performance runner net.iperf to test intern-node network bandwidth. On a given cluster node, a number of iperf processes (according to the number of CPU cores) are started as servers. The remaining cluster nodes will be used as clients to generate network traffic. The accumulated bandwidth of all per-node iperf processes is reported. This should reflect the maximum achievable network throughput on all cluster nodes. For example:
```
root@master # salt-run net.iperf cluster=ceph output=full
192.168.128.1:
    8644.0 Mbits/sec
192.168.128.2:
    10360.0 Mbits/sec
192.168.128.3:
    9336.0 Mbits/sec
192.168.128.4:
    9588.56 Mbits/sec
192.168.128.5:
    10187.0 Mbits/sec
192.168.128.6:
    10465.0 Mbits/sec
```
Check firewall settings on cluster nodes. Make sure they do not block ports/protocols required by Ceph operation. See Section 20.10, “Firewall Settings for Ceph” for more information on firewall settings.
Check the networking hardware, such as network cards, cables, or switches, for proper operation.

Tip: Separate Network

To ensure fast and safe network communication between cluster nodes, set up a separate network used exclusively by the cluster OSD and monitor nodes.

22.11 `/var` Running Out of Space #

By default, the Salt master saves every minion's return for every job in its job cache. The cache can then be used later to lookup results for previous jobs. The cache directory defaults to /var/cache/salt/master/jobs/.

Each job return from every minion is saved in a single file. Over time this directory can grow very large, depending on the number of published jobs and the value of the keep_jobs option in the /etc/salt/master file. keep_jobs sets the number of hours (24 by default) to keep information about past minion jobs.

keep_jobs: 24

Important: Do Not Set `keep_jobs: 0`

Setting keep_jobs to '0' will cause the job cache cleaner to never run, possibly resulting in a full partition.

If you want to disable the job cache, set job_cache to 'False':

job_cache: False

Tip: Restoring Partition Full because of Job Cache

When the partition with job cache files gets full because of wrong keep_jobs setting, follow these steps to free disk space and improve the job cache settings:

Stop the Salt master service:

root@master # systemctl stop salt-master

Change the Salt master configuration related to job cache by editing /etc/salt/master:
```
job_cache: False
keep_jobs: 1
```

Clear the Salt master job cache:

root # rm -rfv /var/cache/salt/master/jobs/*

Start the Salt master service:

root@master # systemctl start salt-master

22.12 Too Many PGs Per OSD #

The TOO_MANY_PGS flag is raised when the number of PGs in use is above the configurable threshold of mon_pg_warn_max_per_osd PGs per OSD. If this threshold is exceeded, the cluster does not allow new pools to be created, pool pg_num to be increased, or pool replication to be increased.

SUSE Enterprise Storage 4 and 5.5 have two ways to solve this issue.

Procedure 22.1: Solution 1 #

Set the following in your ceph.conf:

[global]

mon_max_pg_per_osd = 800  #  depends on your amount of PGs
osd max pg per osd hard ratio = 10 #  the default is 2. We recommend to set at least 5.
mon allow pool delete = true # without it you can't remove a pool

Restart all MONs and OSDs one by one.

Check the value of your MON and OSD ID:

cephadm > ceph --admin-daemon /var/run/ceph/ceph-mon.ID.asok config get  mon_max_pg_per_osd
cephadm > ceph --admin-daemon /var/run/ceph/ceph-osd.ID.asok config get osd_max_pg_per_osd_hard_ratio

Execute the following to determine your default pg_num:

cephadm > rados lspools
cephadm > ceph osd pool get USER-EMAIL pg_num

With caution, execute the following commands to remove pools:

cephadm > ceph osd pool create USER-EMAILnew 8
cephadm > rados cppool USER-EMAIL default.rgw.lc.new
cephadm > ceph osd pool delete USER-EMAIL USER-EMAIL --yes-i-really-really-mean-it
cephadm > ceph osd pool rename USER-EMAIL.new USER-EMAIL
cephadm > ceph osd pool application enable USER-EMAIL rgw

If this does not remove enough PGs per OSD and you are still receiving blocking requests, you may need to find another pool to remove.

Procedure 22.2: Solution 2 #

Create a new pool with the correct PG count:

cephadm > ceph osd pool create NEW-POOL PG-COUNT

Copy the contents of the old pool the new pool:
```
cephadm > rados cppool OLD-POOL NEW-POOL
```

Remove the old pool:

cephadm > ceph osd pool delete OLD-POOL OLD-POOL --yes-i-really-really-mean-it

Rename the new pool:

cephadm > ceph osd pool rename NEW-POOL OLD-POOL

Restart the Object Gateway.