22 Troubleshooting #
This chapter describes several issues that you may face when you operate a Ceph cluster.
22.1 Reporting Software Problems #
If you come across a problem when running SUSE Enterprise Storage 5.5 related to some of its
components, such as Ceph or Object Gateway, report the problem to SUSE Technical
Support. The recommended way is with the supportconfig
utility.
Tip
Because supportconfig is modular software, make sure
that the supportutils-plugin-ses package is
installed.
cephadm > rpm -q supportutils-plugin-sesIf it is missing on the Ceph server, install it with
root # zypper ref && zypper in supportutils-plugin-ses
Although you can use supportconfig on the command line,
we recommend using the related YaST module. Find more information about
supportconfig in
https://documentation.suse.com/sles/12-SP5/single-html/SLES-admin/#sec-admsupport-supportconfig.
22.2 Sending Large Objects with rados Fails with Full OSD #
rados is a command line utility to manage RADOS object
storage. For more information, see man 8 rados.
If you send a large object to a Ceph cluster with the
rados utility, such as
cephadm > rados -p mypool put myobject /file/to/sendit can fill up all the related OSD space and cause serious trouble to the cluster performance.
22.3 Corrupted XFS File system #
In rare circumstances like kernel bug or broken/misconfigured hardware, the underlying file system (XFS) in which an OSD stores its data might be damaged and unmountable.
If you are sure there is no problem with your hardware and the system is configured properly, raise a bug against the XFS subsystem of the SUSE Linux Enterprise Server kernel and mark the particular OSD as down:
cephadm > ceph osd down OSD identificationWarning: Do Not Format or Otherwise Modify the Damaged Device
Even though using xfs_repair to fix the problem in the
file system may seem reasonable, do not use it as the command modifies the
file system. The OSD may start but its functioning may be influenced.
Now zap the underlying disk and re-create the OSD by running:
cephadm > ceph-disk prepare --zap $OSD_DISK_DEVICE $OSD_JOURNAL_DEVICE"for example:
cephadm > ceph-disk prepare --zap /dev/sdb /dev/sdd222.4 'Too Many PGs per OSD' Status Message #
If you receive a Too Many PGs per OSD message after
running ceph status, it means that the
mon_pg_warn_max_per_osd value (300 by default) was
exceeded. This value is compared to the number of PGs per OSD ratio. This
means that the cluster setup is not optimal.
The number of PGs cannot be reduced after the pool is created. Pools that do not yet contain any data can safely be deleted and then re-created with a lower number of PGs. Where pools already contain data, the only solution is to add OSDs to the cluster so that the ratio of PGs per OSD becomes lower.
22.5 'nn pg stuck inactive' Status Message #
If you receive a stuck inactive status message after
running ceph status, it means that Ceph does not know
where to replicate the stored data to fulfill the replication rules. It can
happen shortly after the initial Ceph setup and fix itself automatically.
In other cases, this may require a manual interaction, such as bringing up a
broken OSD, or adding a new OSD to the cluster. In very rare cases, reducing
the replication level may help.
If the placement groups are stuck perpetually, you need to check the output
of ceph osd tree. The output should look tree-structured,
similar to the example in Section 22.7, “OSD is Down”.
If the output of ceph osd tree is rather flat as in the
following example
cephadm > ceph osd tree
ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY
-1 0 root default
0 0 osd.0 up 1.00000 1.00000
1 0 osd.1 up 1.00000 1.00000
2 0 osd.2 up 1.00000 1.00000You should check that the related CRUSH map has a tree structure. If it is also flat, or with no hosts as in the above example, it may mean that host name resolution is not working correctly across the cluster.
If the hierarchy is incorrect—for example the root contains hosts, but
the OSDs are at the top level and are not themselves assigned to
hosts—you will need to move the OSDs to the correct place in the
hierarchy. This can be done using the ceph osd crush move
and/or ceph osd crush set commands. For further details
see Section 7.4, “CRUSH Map Manipulation”.
22.6 OSD Weight is 0 #
When OSD starts, it is assigned a weight. The higher the weight, the bigger the chance that the cluster writes data to the OSD. The weight is either specified in a cluster CRUSH Map, or calculated by the OSDs' start-up script.
In some cases, the calculated value for OSDs' weight may be rounded down to zero. It means that the OSD is not scheduled to store data, and no data is written to it. The reason is usually that the disk is too small (smaller than 15GB) and should be replaced with a bigger one.
22.7 OSD is Down #
OSD daemon is either running, or stopped/down. There are 3 general reasons why an OSD is down:
Hard disk failure.
The OSD crashed.
The server crashed.
You can see the detailed status of OSDs by running
cephadm > ceph osd tree
# id weight type name up/down reweight
-1 0.02998 root default
-2 0.009995 host doc-ceph1
0 0.009995 osd.0 up 1
-3 0.009995 host doc-ceph2
1 0.009995 osd.1 up 1
-4 0.009995 host doc-ceph3
2 0.009995 osd.2 down 1
The example listing shows that the osd.2 is down. Then
you may check if the disk where the OSD is located is mounted:
root # lsblk -f
[...]
vdb
├─vdb1 /var/lib/ceph/osd/ceph-2
└─vdb2
You can track the reason why the OSD is down by inspecting its log file
/var/log/ceph/ceph-osd.2.log. After you find and fix
the reason why the OSD is not running, start it with
root # systemctl start ceph-osd@2.service
Do not forget to replace 2 with the actual number of your
stopped OSD.
22.8 Finding Slow OSDs #
When tuning the cluster performance, it is very important to identify slow storage/OSDs within the cluster. The reason is that if the data is written to the slow(est) disk, the complete write operation slows down as it always waits until it is finished on all the related disks.
It is not trivial to locate the storage bottleneck. You need to examine each and every OSD to find out the ones slowing down the write process. To do a benchmark on a single OSD, run:
ceph tell osd.OSD_ID_NUMBER benchFor example:
cephadm > ceph tell osd.0 bench
{ "bytes_written": 1073741824,
"blocksize": 4194304,
"bytes_per_sec": "19377779.000000"}
Then you need to run this command on each OSD and compare the
bytes_per_sec value to get the slow(est) OSDs.
22.9 Fixing Clock Skew Warnings #
The time information in all cluster nodes must be synchronized. If a node's time is not fully synchronized, you may get clock skew warnings when checking the state of the cluster.
Time synchronization is managed with NTP (see http://en.wikipedia.org/wiki/Network_Time_Protocol). Set each node to synchronize its time with one or more NTP servers, preferably to the same group of NTP servers. If the time skew still occurs on a node, follow these steps to fix it:
root #systemctl stop ntpd.serviceroot #systemctl stop ceph-mon.targetroot #systemctl start ntpd.serviceroot #systemctl start ceph-mon.target
You can then query the NTP peers and check the time offset with
sudo ntpq -p.
The Ceph monitors need to have their clocks synchronized to within 0.05 seconds of each other. Refer to Section 20.4, “Time Synchronization of Nodes” for more information.
22.10 Poor Cluster Performance Caused by Network Problems #
There are more reasons why the cluster performance may become weak. One of them can be network problems. In such case, you may notice the cluster reaching quorum, OSD and monitor nodes going offline, data transfers taking a long time, or a lot of reconnect attempts.
To check whether cluster performance is degraded by network problems,
inspect the Ceph log files under the /var/log/ceph
directory.
To fix network issues on the cluster, focus on the following points:
Basic network diagnostics. Try DeepSea diagnostics tools runner
net.pingto ping between cluster nodes to see if individual interface can reach to specific interface and the average response time. Any specific response time much slower then average will also be reported. For example:root@master #salt-run net.ping Succeeded: 8 addresses from 7 minions average rtt 0.15 msTry validating all interface with JumboFrame enable:
root@master #salt-run net.jumbo_ping Succeeded: 8 addresses from 7 minions average rtt 0.26 msNetwork performance benchmark. Try DeepSea's network performance runner
net.iperfto test intern-node network bandwidth. On a given cluster node, a number ofiperfprocesses (according to the number of CPU cores) are started as servers. The remaining cluster nodes will be used as clients to generate network traffic. The accumulated bandwidth of all per-nodeiperfprocesses is reported. This should reflect the maximum achievable network throughput on all cluster nodes. For example:root@master #salt-run net.iperf cluster=ceph output=full 192.168.128.1: 8644.0 Mbits/sec 192.168.128.2: 10360.0 Mbits/sec 192.168.128.3: 9336.0 Mbits/sec 192.168.128.4: 9588.56 Mbits/sec 192.168.128.5: 10187.0 Mbits/sec 192.168.128.6: 10465.0 Mbits/secCheck firewall settings on cluster nodes. Make sure they do not block ports/protocols required by Ceph operation. See Section 20.10, “Firewall Settings for Ceph” for more information on firewall settings.
Check the networking hardware, such as network cards, cables, or switches, for proper operation.
Tip: Separate Network
To ensure fast and safe network communication between cluster nodes, set up a separate network used exclusively by the cluster OSD and monitor nodes.
22.11 /var Running Out of Space #
By default, the Salt master saves every minion's return for every job in its
job cache. The cache can then be used later to lookup
results for previous jobs. The cache directory defaults to
/var/cache/salt/master/jobs/.
Each job return from every minion is saved in a single file. Over time this
directory can grow very large, depending on the number of published jobs and
the value of the keep_jobs option in the
/etc/salt/master file. keep_jobs sets
the number of hours (24 by default) to keep information about past minion
jobs.
keep_jobs: 24
Important: Do Not Set keep_jobs: 0
Setting keep_jobs to '0' will cause the job cache cleaner
to never run, possibly resulting in a full partition.
If you want to disable the job cache, set job_cache to
'False':
job_cache: False
Tip: Restoring Partition Full because of Job Cache
When the partition with job cache files gets full because of wrong
keep_jobs setting, follow these steps to free disk space
and improve the job cache settings:
Stop the Salt master service:
root@master #systemctl stop salt-masterChange the Salt master configuration related to job cache by editing
/etc/salt/master:job_cache: False keep_jobs: 1
Clear the Salt master job cache:
root #rm -rfv /var/cache/salt/master/jobs/*Start the Salt master service:
root@master #systemctl start salt-master
22.12 Too Many PGs Per OSD #
The TOO_MANY_PGS flag is raised when the number of PGs in
use is above the configurable threshold of mon_pg_warn_max_per_osd
PGs per OSD. If this threshold is exceeded, the cluster does not allow new pools
to be created, pool pg_num to be increased, or pool
replication to be increased.
SUSE Enterprise Storage 4 and 5.5 have two ways to solve this issue.
Procedure 22.1: Solution 1 #
Set the following in your
ceph.conf:[global] mon_max_pg_per_osd = 800 # depends on your amount of PGs osd max pg per osd hard ratio = 10 # the default is 2. We recommend to set at least 5. mon allow pool delete = true # without it you can't remove a pool
Restart all MONs and OSDs one by one.
Check the value of your MON and OSD ID:
cephadm >ceph --admin-daemon /var/run/ceph/ceph-mon.ID.asok config get mon_max_pg_per_osdcephadm >ceph --admin-daemon /var/run/ceph/ceph-osd.ID.asok config get osd_max_pg_per_osd_hard_ratioExecute the following to determine your default
pg_num:cephadm >rados lspoolscephadm >ceph osd pool get USER-EMAIL pg_numWith caution, execute the following commands to remove pools:
cephadm >ceph osd pool create USER-EMAILnew 8cephadm >rados cppool USER-EMAIL default.rgw.lc.newcephadm >ceph osd pool delete USER-EMAIL USER-EMAIL --yes-i-really-really-mean-itcephadm >ceph osd pool rename USER-EMAIL.new USER-EMAILcephadm >ceph osd pool application enable USER-EMAIL rgwIf this does not remove enough PGs per OSD and you are still receiving blocking requests, you may need to find another pool to remove.
Procedure 22.2: Solution 2 #
Create a new pool with the correct PG count:
cephadm >ceph osd pool create NEW-POOL PG-COUNTCopy the contents of the old pool the new pool:
cephadm >rados cppool OLD-POOL NEW-POOLRemove the old pool:
cephadm >ceph osd pool delete OLD-POOL OLD-POOL --yes-i-really-really-mean-itRename the new pool:
cephadm >ceph osd pool rename NEW-POOL OLD-POOLRestart the Object Gateway.