Determining Cluster State | Administration Guide | SUSE Enterprise Storage 5.5 (SES 5 & SES 5.5)

Applies to SUSE Enterprise Storage 5.5 (SES 5 & SES 5.5)

4 Determining Cluster State #

When you have a running cluster, you may use the ceph tool to monitor it. Determining the cluster state typically involves checking the status of Ceph OSDs, Ceph Monitors, placement groups and Metadata Servers.

Tip: Interactive Mode

To run the ceph tool in an interactive mode, type ceph at the command line with no arguments. The interactive mode is more convenient if you are going to enter more ceph commands in a row. For example:

cephadm > ceph
ceph> health
ceph> status
ceph> quorum_status
ceph> mon_status

4.1 Checking a Cluster's Status #

To check a cluster's status, execute the following:

cephadm > ceph status

cephadm > ceph -s

In interactive mode, type status and press Enter.

ceph> status

Ceph will print the cluster status. For example, a tiny Ceph cluster consisting of one monitor and two OSDs may print the following:

cluster b370a29d-9287-4ca3-ab57-3d824f65e339
 health HEALTH_OK
 monmap e1: 1 mons at {ceph1=10.0.0.8:6789/0}, election epoch 2, quorum 0 ceph1
 osdmap e63: 2 osds: 2 up, 2 in
  pgmap v41332: 952 pgs, 20 pools, 17130 MB data, 2199 objects
        115 GB used, 167 GB / 297 GB avail
               1 active+clean+scrubbing+deep
             951 active+clean

4.2 Checking Cluster Health #

After you start your cluster and before you start reading and/or writing data, check your cluster's health:

cephadm > ceph health
HEALTH_WARN 10 pgs degraded; 100 pgs stuck unclean; 1 mons down, quorum 0,2 \
node-1,node-2,node-3

Tip

If you specified non-default locations for your configuration or keyring, you may specify their locations:

cephadm > ceph -c /path/to/conf -k /path/to/keyring health

The Ceph cluster returns one of the following health codes:

OSD_DOWN

One or more OSDs are marked down. The OSD daemon may have been stopped, or peer OSDs may be unable to reach the OSD over the network. Common causes include a stopped or crashed daemon, a down host, or a network outage.

Verify the host is healthy, the daemon is started, and network is functioning. If the daemon has crashed, the daemon log file (/var/log/ceph/ceph-osd.*) may contain debugging information.

OSD_crush type_DOWN, for example OSD_HOST_DOWN

All the OSDs within a particular CRUSH subtree are marked down, for example all OSDs on a host.

OSD_ORPHAN

An OSD is referenced in the CRUSH map hierarchy but does not exist. The OSD can be removed from the CRUSH hierarchy with:

cephadm > ceph osd crush rm osd.ID

OSD_OUT_OF_ORDER_FULL

The usage thresholds for backfillfull (defaults to 0.90), nearfull (defaults to 0.85), full (defaults to 0.95), and/or failsafe_full are not ascending. In particular, we expect backfillfull < nearfull, nearfull < full, and full < failsafe_full.

To read the current values, run:

cephadm > ceph health detail
HEALTH_ERR 1 full osd(s); 1 backfillfull osd(s); 1 nearfull osd(s)
osd.3 is full at 97%
osd.4 is backfill full at 91%
osd.2 is near full at 87%

The thresholds can be adjusted with the following commands:

cephadm > ceph osd set-backfillfull-ratio ratio
cephadm > ceph osd set-nearfull-ratio ratio
cephadm > ceph osd set-full-ratio ratio

OSD_FULL

One or more OSDs has exceeded the full threshold and is preventing the cluster from servicing writes. Usage by pool can be checked with:

cephadm > ceph df

The currently defined full ratio can be seen with:

cephadm > ceph osd dump | grep full_ratio

A short-term workaround to restore write availability is to raise the full threshold by a small amount:

cephadm > ceph osd set-full-ratio ratio

Add new storage to the cluster by deploying more OSDs, or delete existing data in order to free up space.

OSD_BACKFILLFULL

One or more OSDs has exceeded the backfillfull threshold, which prevents data from being allowed to rebalance to this device. This is an early warning that rebalancing may not be able to complete and that the cluster is approaching full. Usage by pool can be checked with:

cephadm > ceph df

OSD_NEARFULL

One or more OSDs has exceeded the nearfull threshold. This is an early warning that the cluster is approaching full. Usage by pool can be checked with:

cephadm > ceph df

OSDMAP_FLAGS

One or more cluster flags of interest has been set. With the exception of full, these flags can be set or cleared with:

cephadm > ceph osd set flag
cephadm > ceph osd unset flag

These flags include:

full: The cluster is flagged as full and cannot service writes.
pauserd, pausewr: Paused reads or writes.
noup: OSDs are not allowed to start.
nodown: OSD failure reports are being ignored, such that the monitors will not mark OSDs down.
noin: OSDs that were previously marked out will not be marked back in when they start.
noout: Down OSDs will not automatically be marked out after the configured interval.
nobackfill, norecover, norebalance: Recovery or data rebalancing is suspended.
noscrub, nodeep_scrub: Scrubbing (see Section 7.5, “Scrubbing”) is disabled.
notieragent: Cache tiering activity is suspended.

OSD_FLAGS

One or more OSDs has a per-OSD flag of interest set. These flags include:

noup: OSD is not allowed to start.
nodown: Failure reports for this OSD will be ignored.
noin: If this OSD was previously marked out automatically after a failure, it will not be marked in when it starts.
noout: If this OSD is down, it will not be automatically marked out after the configured interval.

Per-OSD flags can be set and cleared with:

cephadm > ceph osd add-flag osd-ID
cephadm > ceph osd rm-flag osd-ID

OLD_CRUSH_TUNABLES

The CRUSH Map is using very old settings and should be updated. The oldest tunables that can be used (that is the oldest client version that can connect to the cluster) without triggering this health warning is determined by the mon_crush_min_required_version configuration option.

OLD_CRUSH_STRAW_CALC_VERSION

The CRUSH Map is using an older, non-optimal method for calculating intermediate weight values for straw buckets. The CRUSH Map should be updated to use the newer method (straw_calc_version=1).

CACHE_POOL_NO_HIT_SET

One or more cache pools is not configured with a hit set to track usage, which prevents the tiering agent from identifying cold objects to flush and evict from the cache. Hit sets can be configured on the cache pool with:

cephadm > ceph osd pool set poolname hit_set_type type
cephadm > ceph osd pool set poolname hit_set_period period-in-seconds
cephadm > ceph osd pool set poolname hit_set_count number-of-hitsets
cephadm > ceph osd pool set poolname hit_set_fpp target-false-positive-rate

For more information on cache tiering, see Chapter 11, Cache Tiering.

OSD_NO_SORTBITWISE

No pre-luminous v12 OSDs are running but the sortbitwise flag has not been set. You need to set the sortbitwise flag before luminous v12 or newer OSDs can start:

cephadm > ceph osd set sortbitwise

POOL_FULL

One or more pools has reached its quota and is no longer allowing writes. You can set pool quotas and usage with:

cephadm > ceph df detail

You can either raise the pool quota with

cephadm > ceph osd pool set-quota poolname max_objects num-objects
cephadm > ceph osd pool set-quota poolname max_bytes num-bytes

or delete some existing data to reduce usage.

PG_AVAILABILITY

Data availability is reduced, meaning that the cluster is unable to service potential read or write requests for some data in the cluster. Specifically, one or more PGs is in a state that does not allow IO requests to be serviced. Problematic PG states include peering, stale, incomplete, and the lack of active (if those conditions do not clear quickly). Detailed information about which PGs are affected is available from:

cephadm > ceph health detail

In most cases the root cause is that one or more OSDs is currently down. The state of specific problematic PGs can be queried with:

cephadm > ceph tell pgid query

PG_DEGRADED

Data redundancy is reduced for some data, meaning the cluster does not have the desired number of replicas for all data (for replicated pools) or erasure code fragments (for erasure coded pools). Specifically, one or more PGs have either the degraded or undersized flag set (there are not enough instances of that placement group in the cluster), or have not had the clean flag set for some time. Detailed information about which PGs are affected is available from:

cephadm > ceph health detail

In most cases the root cause is that one or more OSDs is currently down. The state of specific problematic PGs can be queried with:

cephadm > ceph tell pgid query

PG_DEGRADED_FULL

Data redundancy may be reduced or at risk for some data because of a lack of free space in the cluster. Specifically, one or more PGs has the backfill_toofull or recovery_toofull flag set, meaning that the cluster is unable to migrate or recover data because one or more OSDs is above the backfillfull threshold.

PG_DAMAGED

Data scrubbing (see Section 7.5, “Scrubbing”) has discovered some problems with data consistency in the cluster. Specifically, one or more PGs has the inconsistent or snaptrim_error flag is set, indicating an earlier scrub operation found a problem, or that the repair flag is set, meaning a repair for such an inconsistency is currently in progress.

OSD_SCRUB_ERRORS

Recent OSD scrubs have uncovered inconsistencies.

CACHE_POOL_NEAR_FULL

A cache tier pool is nearly full. Full in this context is determined by the target_max_bytes and target_max_objects properties on the cache pool. When the pool reaches the target threshold, write requests to the pool may block while data is flushed and evicted from the cache, a state that normally leads to very high latencies and poor performance. The cache pool target size can be adjusted with:

cephadm > ceph osd pool set cache-pool-name target_max_bytes bytes
cephadm > ceph osd pool set cache-pool-name target_max_objects objects

Normal cache flush and evict activity may also be throttled because of reduced availability or performance of the base tier, or overall cluster load.

Find more information about cache tiering in Chapter 11, Cache Tiering.

TOO_FEW_PGS

The number of PGs in use is below the configurable threshold of mon_pg_warn_min_per_osd PGs per OSD. This can lead to suboptimal distribution and balance of data across the OSDs in the cluster reduce overall performance.

See Placement Groups for details on calculating an appropriate number of placement groups for your pool.

TOO_MANY_PGS

The number of PGs in use is above the configurable threshold of mon_pg_warn_max_per_osd PGs per OSD. This can lead to higher memory usage for OSD daemons, slower peering after cluster state changes (for example OSD restarts, additions, or removals), and higher load on the Ceph Managers and Ceph Monitors.

While the pg_num value for existing pools cannot be reduced. The pgp_num value can. This effectively collocates some PGs on the same sets of OSDs, mitigating some of the negative impacts described above. The pgp_num value can be adjusted with:

cephadm > ceph osd pool set pool pgp_num value

SMALLER_PGP_NUM

One or more pools has a pgp_num value less than pg_num. This is normally an indication that the PG count was increased without also increasing the placement behavior. This is normally resolved by setting pgp_num to match pg_num, triggering the data migration, with:

cephadm > ceph osd pool set pool pgp_num pg_num_value

MANY_OBJECTS_PER_PG

One or more pools have an average number of objects per PG that is significantly higher than the overall cluster average. The specific threshold is controlled by the mon_pg_warn_max_object_skew configuration value. This is usually an indication that the pool(s) containing most of the data in the cluster have too few PGs, and/or that other pools that do not contain as much data have too many PGs. The threshold can be raised to silence the health warning by adjusting the mon_pg_warn_max_object_skew configuration option on the monitors.

POOL_APP_NOT_ENABLED¶

A pool exists that contains one or more objects but has not been tagged for use by a particular application. Resolve this warning by labeling the pool for use by an application. For example, if the pool is used by RBD:

cephadm > rbd pool init pool_name

If the pool is being used by a custom application 'foo', you can also label it using the low-level command:

cephadm > ceph osd pool application enable foo

POOL_FULL

One or more pools have reached (or is very close to reaching) its quota. The threshold to trigger this error condition is controlled by the mon_pool_quota_crit_threshold configuration option. Pool quotas can be adjusted up or down (or removed) with:

cephadm > ceph osd pool set-quota pool max_bytes bytes
cephadm > ceph osd pool set-quota pool max_objects objects

Setting the quota value to 0 will disable the quota.

POOL_NEAR_FULL

One or more pools are approaching their quota. The threshold to trigger this warning condition is controlled by the mon_pool_quota_warn_threshold configuration option. Pool quotas can be adjusted up or down (or removed) with:

cephadm > ceph osd osd pool set-quota pool max_bytes bytes
cephadm > ceph osd osd pool set-quota pool max_objects objects

Setting the quota value to 0 will disable the quota.

OBJECT_MISPLACED

One or more objects in the cluster are not stored on the node where the cluster wants it. This is an indication that data migration caused by a recent cluster change has not yet completed. Misplaced data is not a dangerous condition in itself. Data consistency is never at risk, and old copies of objects are never removed until the desired number of new copies (in the desired locations) are present.

OBJECT_UNFOUND

One or more objects in the cluster cannot be found. Specifically, the OSDs know that a new or updated copy of an object should exist, but a copy of that version of the object has not been found on OSDs that are currently online. Read or write requests to the 'unfound' objects will be blocked. Ideally, a down OSD can be brought back online that has the more recent copy of the unfound object. Candidate OSDs can be identified from the peering state for the PG(s) responsible for the unfound object:

cephadm > ceph tell pgid query

REQUEST_SLOW

One or more OSD requests is taking a long time to process. This can be an indication of extreme load, a slow storage device, or a software bug. You can query the request queue on the OSD(s) in question with the following command executed from the OSD host:

cephadm > ceph daemon osd.id ops

You can see a summary of the slowest recent requests:

cephadm > ceph daemon osd.id dump_historic_ops

You can find the location of an OSD with:

cephadm > ceph osd find osd.id

REQUEST_STUCK

One or more OSD requests have been blocked for a longer time, for example 4096 seconds. This is an indication that either the cluster has been unhealthy for an extended period of time (for example not enough running OSDs or inactive PGs) or there is some internal problem with the OSD.

PG_NOT_SCRUBBED

One or more PGs have not been scrubbed (see Section 7.5, “Scrubbing”) recently. PGs are normally scrubbed every mon_scrub_interval seconds, and this warning triggers when mon_warn_not_scrubbed such intervals have elapsed without a scrub. PGs will not scrub if they are not flagged as clean, which may happen if they are misplaced or degraded (see PG_AVAILABILITY and PG_DEGRADED above). You can manually initiate a scrub of a clean PG with:

cephadm > ceph pg scrub pgid

PG_NOT_DEEP_SCRUBBED

One or more PGs has not been deep scrubbed (see Section 7.5, “Scrubbing”) recently. PGs are normally scrubbed every osd_deep_mon_scrub_interval seconds, and this warning triggers when mon_warn_not_deep_scrubbed seconds have elapsed without a scrub. PGs will not (deep)scrub if they are not flagged as clean, which may happen if they are misplaced or degraded (see PG_AVAILABILITY and PG_DEGRADED above). You can manually initiate a scrub of a clean PG with:

cephadm > ceph pg deep-scrub pgid

4.3 Watching a Cluster #

You can find the immediate state of the cluster using ceph -s. For example, a tiny Ceph cluster consisting of one monitor, and two OSDs may print the following when a workload is running:

cephadm > ceph -s
cluster:
  id:     ea4cf6ce-80c6-3583-bb5e-95fa303c893f
  health: HEALTH_WARN
          too many PGs per OSD (408 > max 300)

services:
  mon: 3 daemons, quorum ses5min1,ses5min3,ses5min2
  mgr: ses5min1(active), standbys: ses5min3, ses5min2
  mds: cephfs-1/1/1 up  {0=ses5min3=up:active}
  osd: 4 osds: 4 up, 4 in
  rgw: 1 daemon active

data:
  pools:   8 pools, 544 pgs
  objects: 253 objects, 3821 bytes
  usage:   6252 MB used, 13823 MB / 20075 MB avail
  pgs:     544 active+clean

The output provides the following information:

Cluster ID
Cluster health status
The monitor map epoch and the status of the monitor quorum
The OSD map epoch and the status of OSDs
The status of Ceph Managers.
The status of Object Gateways.
The placement group map version
The number of placement groups and pools
The notional amount of data stored and the number of objects stored; and,
The total amount of data stored.

Tip: How Ceph Calculates Data Usage

The used value reflects the actual amount of raw storage used. The xxx GB / xxx GB value means the amount available (the lesser number) of the overall storage capacity of the cluster. The notional number reflects the size of the stored data before it is replicated, cloned or snapshot. Therefore, the amount of data actually stored typically exceeds the notional amount stored, because Ceph creates replicas of the data and may also use storage capacity for cloning and snapshotting.

Other commands that display immediate status information are:

ceph pg stat
ceph osd pool stats
ceph df
ceph df detail

To get the information updated in real time, put any of these commands (including ceph -s) as an argument of the watch command:

root # watch -n 10 'ceph -s'

Press Ctrl–C when you are tired of watching.

4.4 Checking a Cluster's Usage Stats #

To check a cluster’s data usage and distribution among pools, use the ceph df command. To get more details, use ceph df detail.

cephadm > ceph df
GLOBAL:
    SIZE       AVAIL      RAW USED     %RAW USED
    65886G     45826G        7731M            16
POOLS:
    NAME         ID     USED      %USED     MAX AVAIL     OBJECTS
    data         1      1726M        10        17676G        1629
    rbd          4      5897M        27        22365G        3547
    ecpool       6        69M       0.2        35352G          31
[...]

The GLOBAL section of the output provides an overview of the amount of storage your cluster uses for your data.

SIZE: The overall storage capacity of the cluster.
AVAIL: The amount of free space available in the cluster.
RAW USED: The amount of raw storage used.
% RAW USED: The percentage of raw storage used. Use this number in conjunction with the full ratio and near full ratio to ensure that you are not reaching your cluster’s capacity. See Storage Capacity for additional details.
Note: Cluster Fill Level
When a raw storage fill level is getting close to 100%, you need to add new storage to the cluster. A higher usage may lead to single full OSDs and cluster health problems.
Use the command ceph osd df tree to list the fill level of all OSDs.

The POOLS section of the output provides a list of pools and the notional usage of each pool. The output from this section does not reflect replicas, clones or snapshots. For example, if you store an object with 1MB of data, the notional usage will be 1MB, but the actual usage may be 2MB or more depending on the number of replicas, clones and snapshots.

NAME: The name of the pool.
ID: The pool ID.
USED: The notional amount of data stored in kilobytes, unless the number appends M for megabytes or G for gigabytes.
%USED: The notional percentage of storage used per pool.
MAX AVAIL: The maximum available space in the given pool.
OBJECTS: The notional number of objects stored per pool.

Note

The numbers in the POOLS section are notional. They are not inclusive of the number of replicas, snapshots or clones. As a result, the sum of the USED and %USED amounts will not add up to the RAW USED and %RAW USED amounts in the %GLOBAL section of the output.

4.5 Checking OSD Status #

You can check OSDs to ensure they are up and on by executing:

cephadm > ceph osd stat

cephadm > ceph osd dump

You can also view OSDs according to their position in the CRUSH map.

cephadm > ceph osd tree

Ceph will print a CRUSH tree with a host, its OSDs, whether they are up and their weight.

# id    weight  type name       up/down reweight
-1      3       pool default
-3      3               rack mainrack
-2      3                       host osd-host
0       1                               osd.0   up      1
1       1                               osd.1   up      1
2       1                               osd.2   up      1

4.6 Checking for Full OSDs #

Ceph prevents you from writing to a full OSD so that you do not lose data. In an operational cluster, you should receive a warning when your cluster is getting near its full ratio. The mon osd full ratio defaults to 0.95, or 95% of capacity before it stops clients from writing data. The mon osd nearfull ratio defaults to 0.85, or 85% of capacity, when it generates a health warning.

Full OSD nodes will be reported by ceph health:

cephadm > ceph health
  HEALTH_WARN 1 nearfull osds
  osd.2 is near full at 85%

cephadm > ceph health
  HEALTH_ERR 1 nearfull osds, 1 full osds
  osd.2 is near full at 85%
  osd.3 is full at 97%

The best way to deal with a full cluster is to add new OSD hosts/disks allowing the cluster to redistribute data to the newly available storage.

Tip: Preventing Full OSDs

After an OSD becomes full—it uses 100% of its disk space—it will normally crash quickly without warning. Following are a few tips to remember when administering OSD nodes.

Each OSD's disk space (usually mounted under /var/lib/ceph/osd/osd-{1,2..}) needs to be placed on a dedicated underlying disk or partition.
Check the Ceph configuration files and make sure that Ceph does not store its log file to the disks/partitions dedicated for use by OSDs.
Make sure that no other process writes to the disks/partitions dedicated for use by OSDs.

4.7 Checking Monitor Status #

After you start the cluster and before first reading and/or writing data, check the Ceph Monitors quorum status. When the cluster is already serving requests, check the Ceph Monitors status periodically to ensure that they are running.

To display the monitor map, execute the following:

cephadm > ceph mon stat

cephadm > ceph mon dump

To check the quorum status for the monitor cluster, execute the following:

cephadm > ceph quorum_status

Ceph will return the quorum status. For example, a Ceph cluster consisting of three monitors may return the following:

{ "election_epoch": 10,
  "quorum": [
        0,
        1,
        2],
  "monmap": { "epoch": 1,
      "fsid": "444b489c-4f16-4b75-83f0-cb8097468898",
      "modified": "2011-12-12 13:28:27.505520",
      "created": "2011-12-12 13:28:27.505520",
      "mons": [
            { "rank": 0,
              "name": "a",
              "addr": "192.168.1.10:6789\/0"},
            { "rank": 1,
              "name": "b",
              "addr": "192.168.1.11:6789\/0"},
            { "rank": 2,
              "name": "c",
              "addr": "192.168.1.12:6789\/0"}
           ]
    }
}

4.8 Checking Placement Group States #

Placement groups map objects to OSDs. When you monitor your placement groups, you will want them to be active and clean. For a detailed discussion, refer to Monitoring OSDs and Placement Groups.

4.9 Using the Admin Socket #

The Ceph admin socket allows you to query a daemon via a socket interface. By default, Ceph sockets reside under /var/run/ceph. To access a daemon via the admin socket, log in to the host running the daemon and use the following command:

cephadm > ceph --admin-daemon /var/run/ceph/socket-name

To view the available admin socket commands, execute the following command:

cephadm > ceph --admin-daemon /var/run/ceph/socket-name help

The admin socket command enables you to show and set your configuration at runtime. Refer to Viewing a Configuration at Runtimefor details.

Additionally, you can set configuration values at runtime directly (the admin socket bypasses the monitor, unlike ceph tell daemon-type.id injectargs, which relies on the monitor but does not require you to log in directly to the host in question).