35 Troubleshooting #
This chapter describes several issues that you may face when you operate a Ceph cluster.
35.1 Reporting Software Problems #
If you come across a problem when running SUSE Enterprise Storage 6
related to some of its components, such as Ceph or Object Gateway, report the
problem to SUSE Technical Support. The recommended way is with the
supportconfig utility.
Tip
Because supportconfig is modular software, make sure
that the supportutils-plugin-ses package is
installed.
tux > rpm -q supportutils-plugin-sesIf it is missing on the Ceph server, install it with
root # zypper ref && zypper in supportutils-plugin-ses
Although you can use supportconfig on the command line,
we recommend using the related YaST module. Find more information about
supportconfig in
https://documentation.suse.com/sles/15-SP1/html/SLES-all/cha-adm-support.html#sec-admsupport-supportconfig.
35.2 Sending Large Objects with rados Fails with Full OSD #
rados is a command line utility to manage RADOS object
storage. For more information, see man 8 rados.
If you send a large object to a Ceph cluster with the
rados utility, such as
cephadm@adm > rados -p mypool put myobject /file/to/sendit can fill up all the related OSD space and cause serious trouble to the cluster performance.
35.3 Corrupted XFS File system #
In rare circumstances like kernel bug or broken/misconfigured hardware, the underlying file system (XFS) in which an OSD stores its data might be damaged and unmountable.
If you are sure there is no problem with your hardware and the system is configured properly, raise a bug against the XFS subsystem of the SUSE Linux Enterprise Server kernel and mark the particular OSD as down:
cephadm@adm > ceph osd down OSD_IDWarning: Do Not Format or Otherwise Modify the Damaged Device
Even though using xfs_repair to fix the problem in the
file system may seem reasonable, do not use it as the command modifies the
file system. The OSD may start but its functioning may be influenced.
Now zap the underlying disk and re-create the OSD by running:
cephadm@osd >ceph-volume lvm zap --data /dev/OSD_DISK_DEVICEcephadm@osd >ceph-volume lvm prepare --bluestore --data /dev/OSD_DISK_DEVICE
35.4 'Too Many PGs per OSD' Status Message #
If you receive a Too Many PGs per OSD message after
running ceph status, it means that the
mon_pg_warn_max_per_osd value (300 by default) was
exceeded. This value is compared to the number of PGs per OSD ratio. This
means that the cluster setup is not optimal.
As of the Nautilus release, the number of PGs can be decreased after a pool is already created. To increase or decrease the amount of PGs for an existing pool, run the following command:
cephadm@adm > ceph osd pool set POOL_NAME NUM_OF_PGNote that this operation is resource intensive and you are encouraged to make such changes incrementally, especially when decreasing (merging) the amount of PGs. We recommend to instead have the PG autoscaler manager module enabled to manage the amount of PGs per pool. For details, see Section 20.4.12, “PG Auto-scaler”
35.5 'nn pg stuck inactive' Status Message #
If you receive a stuck inactive status message after
running ceph status, it means that Ceph does not know
where to replicate the stored data to fulfill the replication rules. It can
happen shortly after the initial Ceph setup and fix itself automatically.
In other cases, this may require a manual interaction, such as bringing up a
broken OSD, or adding a new OSD to the cluster. In very rare cases, reducing
the replication level may help.
If the placement groups are stuck perpetually, you need to check the output
of ceph osd tree. The output should look tree-structured,
similar to the example in Section 35.7, “OSD is Down”.
If the output of ceph osd tree is rather flat as in the
following example
cephadm@adm > ceph osd tree
ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY
-1 0 root default
0 0 osd.0 up 1.00000 1.00000
1 0 osd.1 up 1.00000 1.00000
2 0 osd.2 up 1.00000 1.00000You should check that the related CRUSH map has a tree structure. If it is also flat, or with no hosts as in the above example, it may mean that host name resolution is not working correctly across the cluster.
If the hierarchy is incorrect—for example the root contains hosts, but
the OSDs are at the top level and are not themselves assigned to
hosts—you will need to move the OSDs to the correct place in the
hierarchy. This can be done using the ceph osd crush move
and/or ceph osd crush set commands. For further details
see Section 20.5, “CRUSH Map Manipulation”.
35.6 OSD Weight is 0 #
When OSD starts, it is assigned a weight. The higher the weight, the bigger the chance that the cluster writes data to the OSD. The weight is either specified in a cluster CRUSH Map, or calculated by the OSDs' start-up script.
In some cases, the calculated value for OSDs' weight may be rounded down to zero. It means that the OSD is not scheduled to store data, and no data is written to it. The reason is usually that the disk is too small (smaller than 15GB) and should be replaced with a bigger one.
35.7 OSD is Down #
OSD daemon is either running, or stopped/down. There are 3 general reasons why an OSD is down:
Hard disk failure.
The OSD crashed.
The server crashed.
You can see the detailed status of OSDs by running
cephadm@adm > ceph osd tree
# id weight type name up/down reweight
-1 0.02998 root default
-2 0.009995 host doc-ceph1
0 0.009995 osd.0 up 1
-3 0.009995 host doc-ceph2
1 0.009995 osd.1 up 1
-4 0.009995 host doc-ceph3
2 0.009995 osd.2 down 1
The example listing shows that the osd.2 is down. Then
you may check if the disk where the OSD is located is mounted:
root # lsblk -f
[...]
vdb
├─vdb1 /var/lib/ceph/osd/ceph-2
└─vdb2
You can track the reason why the OSD is down by inspecting its log file
/var/log/ceph/ceph-osd.2.log. After you find and fix
the reason why the OSD is not running, start it with
root # systemctl start ceph-osd@2.service
Do not forget to replace 2 with the actual number of your
stopped OSD.
35.8 Finding Slow OSDs #
When tuning the cluster performance, it is very important to identify slow storage/OSDs within the cluster. The reason is that if the data is written to the slow(est) disk, the complete write operation slows down as it always waits until it is finished on all the related disks.
It is not trivial to locate the storage bottleneck. You need to examine each and every OSD to find out the ones slowing down the write process. To do a benchmark on a single OSD, run:
ceph tell osd.OSD_ID_NUMBER benchFor example:
cephadm@adm > ceph tell osd.0 bench
{ "bytes_written": 1073741824,
"blocksize": 4194304,
"bytes_per_sec": "19377779.000000"}
Then you need to run this command on each OSD and compare the
bytes_per_sec value to get the slow(est) OSDs.
35.9 Fixing Clock Skew Warnings #
The time information in all cluster nodes must be synchronized. If a node's time is not fully synchronized, you may get clock skew warnings when checking the state of the cluster.
By default, DeepSea uses the Admin Node as the time server for other cluster nodes. Therefore, if the Admin Node is not virtualized, select one or more time servers or pools, and synchronize the local time against them. Verify that the time synchronization service is enabled on each system start-up. Find more information on setting up time synchronization in https://www.suse.com/documentation/sles-15/book_sle_admin/data/sec_ntp_yast.html.
If the Admin Node is a virtual machine, provide better time sources for the cluster nodes by overriding the default NTP client configuration:
Edit
/srv/pillar/ceph/stack/global.ymlon the Salt master node and add the following line:time_server: CUSTOM_NTP_SERVER
To add multiple time servers, the format is as follows:
time_server: - CUSTOM_NTP_SERVER1 - CUSTOM_NTP_SERVER2 - CUSTOM_NTP_SERVER3 [...]
Refresh the Salt pillar:
root@master #salt '*' saltutil.pillar_refreshVerify the changed value:
root@master #salt '*' pillar.itemsApply the new setting:
root@master #salt '*' state.apply ceph.time
If the time skew still occurs on a node, follow these steps to fix it:
root #systemctl stop chronyd.serviceroot #systemctl stop ceph-mon.targetroot #systemctl start chronyd.serviceroot #systemctl start ceph-mon.target
You can then check the time offset with chronyc
sourcestats.
The Ceph monitors need to have their clocks synchronized to within 0.05 seconds of each other. Refer to Section 33.4, “Time Synchronization of Nodes” for more information.
35.10 Poor Cluster Performance Caused by Network Problems #
There are more reasons why the cluster performance may become weak. One of them can be network problems. In such case, you may notice the cluster reaching quorum, OSD and monitor nodes going offline, data transfers taking a long time, or a lot of reconnect attempts.
To check whether cluster performance is degraded by network problems,
inspect the Ceph log files under the /var/log/ceph
directory.
To fix network issues on the cluster, focus on the following points:
Basic network diagnostics. Try DeepSea diagnostics tools runner
net.pingto ping between cluster nodes to see if individual interface can reach to specific interface and the average response time. Any specific response time much slower then average will also be reported. For example:root@master #salt-run net.ping Succeeded: 8 addresses from 7 minions average rtt 0.15 msTry validating all interface with JumboFrame enable:
root@master #salt-run net.jumbo_ping Succeeded: 8 addresses from 7 minions average rtt 0.26 msNetwork performance benchmark. Try DeepSea's network performance runner
net.iperfto test intern-node network bandwidth. On a given cluster node, a number ofiperfprocesses (according to the number of CPU cores) are started as servers. The remaining cluster nodes will be used as clients to generate network traffic. The accumulated bandwidth of all per-nodeiperfprocesses is reported. This should reflect the maximum achievable network throughput on all cluster nodes. For example:root@master #salt-run net.iperf cluster=ceph output=full 192.168.128.1: 8644.0 Mbits/sec 192.168.128.2: 10360.0 Mbits/sec 192.168.128.3: 9336.0 Mbits/sec 192.168.128.4: 9588.56 Mbits/sec 192.168.128.5: 10187.0 Mbits/sec 192.168.128.6: 10465.0 Mbits/secCheck firewall settings on cluster nodes. Make sure they do not block ports/protocols required by Ceph operation. See Section 33.9, “Firewall Settings for Ceph” for more information on firewall settings.
Check the networking hardware, such as network cards, cables, or switches, for proper operation.
Tip: Separate Network
To ensure fast and safe network communication between cluster nodes, set up a separate network used exclusively by the cluster OSD and monitor nodes.
35.11 /var Running Out of Space #
By default, the Salt master saves every minion's return for every job in its
job cache. The cache can then be used later to look up
results from previous jobs. The cache directory defaults to
/var/cache/salt/master/jobs/.
Each job return from every minion is saved in a single file. Over time this
directory can grow very large, depending on the number of published jobs and
the value of the keep_jobs option in the
/etc/salt/master file. keep_jobs sets
the number of hours (24 by default) to keep information about past minion
jobs.
keep_jobs: 24
Important: Do Not Set keep_jobs: 0
Setting keep_jobs to '0' will cause the job cache cleaner
to never run, possibly resulting in a full partition.
If you want to disable the job cache, set job_cache to
'False':
job_cache: False
Tip: Restoring Partition Full because of Job Cache
When the partition with job cache files gets full because of wrong
keep_jobs setting, follow these steps to free disk space
and improve the job cache settings:
Stop the Salt master service:
root@master #systemctl stop salt-masterChange the Salt master configuration related to job cache by editing
/etc/salt/master:job_cache: False keep_jobs: 1
Clear the Salt master job cache:
root #rm -rfv /var/cache/salt/master/jobs/*Start the Salt master service:
root@master #systemctl start salt-master
35.12 OSD Panic Occurs when Media Error Happens during FileStore Directory Split #
When a directory split error is generated on FileStore OSDs, the
corresponding OSD is designed to terminate. It is then easier to detect the
problem and avoids introducing inconsistencies. If the OSD terminates
frequently, systemd may disable it permanently depending on the systemd
configuration. After the OSD is disabled, it will be set out and process of
data migration will be started.
You should notice the OSD termination by regularly running the ceph
status command and perform necessary steps to investigate the
cause. If media error is the cause of OSD termination, replace the OSD and
wait for all PGs to complete their backfill to the new replaced OSD. Then
run the deep-scrub for these PGs.
Important
Do not initiate a new deep-scrub by adding new OSDs or changing their weight until the deep-scrub is complete.
For more details about scrubbing, refer to Section 33.2, “Adjusting Scrubbing”.