12 Troubleshooting CephFS #
12.1 Slow or stuck operations #
If you are experiencing apparent hung operations, the first task is to identify where the problem is occurring: in the client, the MDS, or the network connecting them. Start by looking to see if either side has stuck operations and narrow it down from there.
12.2 Checking RADOS health #
If part of the CephFS metadata or data pools is unavailable and CephFS is not responding, it is probably because RADOS itself is unhealthy.
Check the cluster's status with the following command:
cephuser@adm > ceph statusCeph will print the cluster status. Review Chapter 2, Troubleshooting logging and debugging, Chapter 6, Troubleshooting Ceph Monitors and Ceph Managers, Chapter 5, Troubleshooting placement groups (PGs), and Chapter 4, Troubleshooting OSDs for tips on what may be causing the issue.
12.3 MDS #
   If an operation is hung inside the MDS, it will eventually show up in
   ceph health, identifying “slow requests are blocked”. It
   may also identify clients as “failing to respond” or misbehaving in other
   ways. If the MDS identifies specific clients as misbehaving, we recommend
   investigating the root cause. Often it can be the result of the following:
  
- Overloading the system 
- Running an older (misbehaving) client 
- Underlying RADOS issues 
12.3.1 Identifying MDS slow requests #
You can list current operations via the admin socket by running the following command from the MDS host:
cephuser@adm > ceph daemon mds.NAME dump_ops_in_flightIdentify the stuck commands and examine why they are stuck. Usually the last event will have been an attempt to gather locks, or sending the operation off to the MDS log. If it is waiting on the OSDs, fix them. If operations are stuck on a specific inode, you probably have a client holding caps which prevent others from using it. This can be because the client is trying to flush out dirty data or because you have encountered a bug in CephFS’ distributed file lock code (the file “capabilities” [“caps”] system).
If it is a result of a bug in the capabilities code, restarting the MDS is likely to resolve the problem.
If there are no slow requests reported on the MDS, and it is not reporting that clients are misbehaving, either the client has a problem or its requests are not reaching the MDS.
12.4 Kernel mount debugging #
12.4.1 Slow requests #
    Unfortunately, the kernel client does not support the admin socket, but it
    has similar (if limited) interfaces if your kernel has
    debugfs enabled. There will be a folder in
    sys/kernel/debug/ceph/, and that folder contains a
    variety of files that output interesting output when you cat them. These
    files are described below; the most interesting when debugging slow
    requests are probably the mdsc and
    osdc files.
   
- bdi
- BDI info about the Ceph system (blocks dirtied, written, etc) 
- caps
- Counts of file caps structures in-memory and used 
- client_options
- Dumps the options provided to the CephFS mount 
- dentry_Iru
- Dumps the CephFS dentries currently in-memory 
- mdsc
- Dumps current requests to the MDS 
- mdsmap
- Dumps the current MDSMap epoch and MDSes 
- mds_sessions
- Dumps the current sessions to MDSes 
- monc
- Dumps the current maps from the monitor, and any subscriptions held 
- monmap
- Dumps the current monitor map epoch and monitors 
- osdc
- Dumps the current ops in-flight to OSDs (ie, file data IO) 
- osdmap
- Dumps the current OSDMap epoch, pools, and OSDs 
12.5 Disconnecting and remounting the file system #
   Because CephFS has a consistent cache, if your network connection is
   disrupted for a long enough time the client will be forcibly disconnected
   from the system. At this point, the kernel client is in a bind: it cannot
   safely write back dirty data, and many applications do not handle IO errors
   correctly on close(). At the moment, the kernel client
   will remount the FS, but outstanding filesystem IO may or may not be
   satisfied. In these cases, you may need to reboot your client system.
  
   You can identify you are in this situation if
   dmesg/kern.log report something like:
  
Jul 20 08:14:38 teuthology kernel: [3677601.123718] ceph: mds0 closed our session Jul 20 08:14:38 teuthology kernel: [3677601.128019] ceph: mds0 reconnect start Jul 20 08:14:39 teuthology kernel: [3677602.093378] ceph: mds0 reconnect denied Jul 20 08:14:39 teuthology kernel: [3677602.098525] ceph: dropping dirty+flushing Fw state for ffff8802dc150518 1099935956631 Jul 20 08:14:39 teuthology kernel: [3677602.107145] ceph: dropping dirty+flushing Fw state for ffff8801008e8518 1099935946707 Jul 20 08:14:39 teuthology kernel: [3677602.196747] libceph: mds0 172.21.5.114:6812 socket closed (con state OPEN) Jul 20 08:14:40 teuthology kernel: [3677603.126214] libceph: mds0 172.21.5.114:6812 connection reset Jul 20 08:14:40 teuthology kernel: [3677603.132176] libceph: reset on mds0
This is an area of ongoing work to improve the behavior. Kernels will soon be reliably issuing error codes to in-progress IO, although your application(s) may not deal with them well. In the longer-term, we hope to allow reconnect and reclaim of data in cases where it will not violate POSIX semantics.
12.6 Mounting #
12.6.1 Mount I/O error #
    A mount 5 (EIO, I/O error) error typically occurs if a
    MDS server is laggy or if it crashed. Ensure at least one MDS is up and
    running, and the cluster is active + healthy.
   
12.6.2 Mount out of memory error #
    A mount 12 error (ENOMEM, out of memory) with
    cannot allocate memory usually occurs if you have a
    version mismatch between the Ceph Client version and the Ceph Storage
    Cluster version. Check the versions using:
   
ceph -v
If the Ceph Client is behind the Ceph cluster, try to upgrade it:
sudo zypper up sudo zypper in ceph-common
    You may need to uninstall, autoclean and autoremove
    ceph-common and then reinstall it so that you have the
    latest version.
   
12.7 Mounting CephFS using old kernel clients #
The kernel since SUSE Linux Enterprise Server 15 SP2 includes a CephFS client that is able to take full advantage of all the features available on an SES7 cluster. All relevant features and bug fixes are backported to this operating system.
However, it may be necessary to access CephFS from other systems that may provide an older CephFS client, which may not support all the features required by an SUSE Enterprise Storage 7.1 cluster. When this happens, the kernel client will fail to mount the file system and will emit messages similar to the one shown below:
[ 4187.023633] libceph: mon0 192.168.122.150:6789 feature set mismatch, my 107b84a842aca < server's 40107b84a842aca, missing 400000000000000 [ 4187.023838] libceph: mon0 192.168.122.150:6789 missing required protocol features
The message above means that the MON identified 0x400000000000000 as the missing feature in the client (the value 0x107b84a842aca represents all the features supported by the client, while 0x40107b84a842aca represents the minimum set of features required by the cluster). From the following table, which shows the complete list of feature bits, we can see that the missing feature bit 58 (2^58 = 0x400000000000000) is CRUSH_TUNABLES5, NEW_OSDOPREPLY_ENCODING, or FS_FILE_LAYOUT_V2 (all these three features share the same feature bit).
| Feature | Bit | Value | 
|---|---|---|
| UID | 0 | 0x1 | 
| NOSRCADDR | 1 | 0x2 | 
| FLOCK | 3 | 0x8 | 
| SUBSCRIBE2 | 4 | 0x10 | 
| MONNAMES | 5 | 0x20 | 
| RECONNECT_SEQ | 6 | 0x40 | 
| DIRLAYOUTHASH | 7 | 0x80 | 
| OBJECTLOCATOR | 8 | 0x100 | 
| PGID64 | 9 | 0x200 | 
| INCSUBOSDMAP | 10 | 0x400 | 
| PGPOOL3 | 11 | 0x800 | 
| OSDREPLYMUX | 12 | 0x1000 | 
| OSDENC | 13 | 0x2000 | 
| SERVER_KRAKEN | 14 | 0x4000 | 
| MONENC | 15 | 0x8000 | 
| CRUSH_TUNABLES | 18 | 0x40000 | 
| SERVER_LUMINOUS | 21 | 0x200000 | 
| RESEND_ON_SPLIT | 21 | 0x200000 | 
| RADOS_BACKOFF | 21 | 0x200000 | 
| OSDMAP_PG_UPMAP | 21 | 0x200000 | 
| CRUSH_CHOOSE_ARGS | 21 | 0x200000 | 
| MSG_AUTH | 23 | 0x800000 | 
| CRUSH_TUNABLES2 | 25 | 0x2000000 | 
| CREATEPOOLID | 26 | 0x4000000 | 
| REPLY_CREATE_INODE | 27 | 0x8000000 | 
| SERVER_M | 28 | 0x10000000 | 
| MDSENC | 29 | 0x20000000 | 
| OSDHASHPSPOOL | 30 | 0x40000000 | 
| MON_SINGLE_PAXOS | 31 | 0x80000000 | 
| OSD_CACHEPOOL | 35 | 0x800000000 | 
| CRUSH_V2 | 36 | 0x1000000000 | 
| EXPORT_PEER | 37 | 0x2000000000 | 
| OSD_ERASURE_CODES | 38 | 0x4000000000 | 
| OSD_OSD_TMAP2OMAP | 38 | 0x4000000000 | 
| OSDMAP_ENC | 39 | 0x8000000000 | 
| MDS_INLINE_DATA | 40 | 0x10000000000 | 
| CRUSH_TUNABLES3 | 41 | 0x20000000000 | 
| OSD_PRIMARY_AFFINITY | 41 | 0x20000000000 | 
| MSGR_KEEPALIVE2 | 42 | 0x40000000000 | 
| OSD_POOLRESEND | 43 | 0x80000000000 | 
| ERASURE_CODE_PLUGINS_V2 | 44 | 0x100000000000 | 
| OSD_FADVISE_FLAGS | 46 | 0x400000000000 | 
| MDS_QUOTA | 47 | 0x800000000000 | 
| CRUSH_V4 | 48 | 0x1000000000000 | 
| MON_METADATA | 50 | 0x4000000000000 | 
| OSD_BITWISE_HOBJ_SORT | 51 | 0x8000000000000 | 
| OSD_PROXY_WRITE_FEATURES | 52 | 0x10000000000000 | 
| ERASURE_CODE_PLUGINS_V3 | 53 | 0x20000000000000 | 
| OSD_HITSET_GMT | 54 | 0x40000000000000 | 
| HAMMER_0_94_4 | 55 | 0x80000000000000 | 
| NEW_OSDOP_ENCODING | 56 | 0x100000000000000 | 
| MON_STATEFUL_SUB | 57 | 0x200000000000000 | 
| MON_ROUTE_OSDMAP | 57 | 0x200000000000000 | 
| OSDSUBOP_NO_SNAPCONTEXT | 57 | 0x200000000000000 | 
| SERVER_JEWEL | 57 | 0x200000000000000 | 
| CRUSH_TUNABLES5 | 58 | 0x400000000000000 | 
| NEW_OSDOPREPLY_ENCODING | 58 | 0x400000000000000 | 
| FS_FILE_LAYOUT_V2 | 58 | 0x400000000000000 | 
| FS_BTIME | 59 | 0x800000000000000 | 
| FS_CHANGE_ATTR | 59 | 0x800000000000000 | 
| MSG_ADDR2 | 59 | 0x800000000000000 | 
| OSD_RECOVERY_DELETES | 60 | 0x1000000000000000 | 
| CEPHX_V2 | 61 | 0x2000000000000000 | 
| RESERVED | 62 | 0x4000000000000000 | 
   A possible solution to allow an old kernel client to mount a recent CephFS
   is to modify the cluster CRUSH profile. CRUSH profiles define a set of CRUSH
   tunables that are named after the Ceph versions in which they were
   introduced. For example, the firefly tunables are first
   supported in the Firefly release (0.80), and older clients will not be able
   to access the cluster. Thus, to fix the problem shown above, the following
   command can be used:
  
cephuser@adm > ceph osd crush tunables hammerThis will adjust the CRUSH profile to the behaviour it had for the Hammer (0.94) release. Note however that this is not the optimal behaviour for the cluster. To change back to the optimal profile, run the following command:
cephuser@adm > ceph osd crush tunables optimalThe following table lists the available CRUSH profiles and which CRUSH tunables versions (the CRUSH_TUNABLE feature bits in the previous table) they correspond to. It also identifies the minimum kernel version required to use for each profile. Note however that Operating System vendors may choose to backport features to their kernels, so these kernel versions are valid for mainline kernels only. The kernel client included since SUSE Linux Enterprise Server 15 SP2, for example, includes backports of features and bug fixes relevant for usage in SUSE Enterprise Storage 7.1 clusters.
| CRUSH Profile | Ceph Release | CRUSH Tunable | Minimum Kernel Version | 
|---|---|---|---|
| argonaut | 0.48 | CRUSH_TUNABLES | 3.6 | 
| bobtail | 0.56 | CRUSH_TUNABLES2 | 3.9 | 
| firefly | 0.80 | CRUSH_TUNABLES3 | 3.15 | 
| hammer | 0.94 | CRUSH_V4 | 4.1 | 
| jewel | 10.2.0 | CRUSH_TUNABLES5 | 4.5 |