4 Determining Cluster State #
When you have a running cluster, you may use the ceph tool
to monitor it. Determining the cluster state typically involves checking the
status of Ceph OSDs, Ceph Monitors, placement groups and Metadata Servers.
Tip: Interactive Mode
To run the ceph tool in an interactive mode, type
ceph at the command line with no arguments. The
interactive mode is more convenient if you are going to enter more
ceph commands in a row. For example:
cephadm > ceph
ceph> health
ceph> status
ceph> quorum_status
ceph> mon_status4.1 Checking a Cluster's Status #
To check a cluster's status, execute the following:
cephadm > ceph statusor
cephadm > ceph -s
In interactive mode, type status and press
Enter.
ceph> status
Ceph will print the cluster status. For example, a tiny Ceph cluster consisting of one monitor and two OSDs may print the following:
cluster b370a29d-9287-4ca3-ab57-3d824f65e339
health HEALTH_OK
monmap e1: 1 mons at {ceph1=10.0.0.8:6789/0}, election epoch 2, quorum 0 ceph1
osdmap e63: 2 osds: 2 up, 2 in
pgmap v41332: 952 pgs, 20 pools, 17130 MB data, 2199 objects
115 GB used, 167 GB / 297 GB avail
1 active+clean+scrubbing+deep
951 active+clean4.2 Checking Cluster Health #
After you start your cluster and before you start reading and/or writing data, check your cluster's health:
cephadm > ceph health
HEALTH_WARN 10 pgs degraded; 100 pgs stuck unclean; 1 mons down, quorum 0,2 \
node-1,node-2,node-3Tip
If you specified non-default locations for your configuration or keyring, you may specify their locations:
cephadm > ceph -c /path/to/conf -k /path/to/keyring healthThe Ceph cluster returns one of the following health codes:
- OSD_DOWN
One or more OSDs are marked down. The OSD daemon may have been stopped, or peer OSDs may be unable to reach the OSD over the network. Common causes include a stopped or crashed daemon, a down host, or a network outage.
Verify the host is healthy, the daemon is started, and network is functioning. If the daemon has crashed, the daemon log file (
/var/log/ceph/ceph-osd.*) may contain debugging information.- OSD_crush type_DOWN, for example OSD_HOST_DOWN
All the OSDs within a particular CRUSH subtree are marked down, for example all OSDs on a host.
- OSD_ORPHAN
An OSD is referenced in the CRUSH map hierarchy but does not exist. The OSD can be removed from the CRUSH hierarchy with:
cephadm >ceph osd crush rm osd.ID- OSD_OUT_OF_ORDER_FULL
The usage thresholds for backfillfull (defaults to 0.90), nearfull (defaults to 0.85), full (defaults to 0.95), and/or failsafe_full are not ascending. In particular, we expect backfillfull < nearfull, nearfull < full, and full < failsafe_full.
To read the current values, run:
cephadm >ceph health detail HEALTH_ERR 1 full osd(s); 1 backfillfull osd(s); 1 nearfull osd(s) osd.3 is full at 97% osd.4 is backfill full at 91% osd.2 is near full at 87%The thresholds can be adjusted with the following commands:
cephadm >ceph osd set-backfillfull-ratio ratiocephadm >ceph osd set-nearfull-ratio ratiocephadm >ceph osd set-full-ratio ratio- OSD_FULL
One or more OSDs has exceeded the full threshold and is preventing the cluster from servicing writes. Usage by pool can be checked with:
cephadm >ceph dfThe currently defined full ratio can be seen with:
cephadm >ceph osd dump | grep full_ratioA short-term workaround to restore write availability is to raise the full threshold by a small amount:
cephadm >ceph osd set-full-ratio ratioAdd new storage to the cluster by deploying more OSDs, or delete existing data in order to free up space.
- OSD_BACKFILLFULL
One or more OSDs has exceeded the backfillfull threshold, which prevents data from being allowed to rebalance to this device. This is an early warning that rebalancing may not be able to complete and that the cluster is approaching full. Usage by pool can be checked with:
cephadm >ceph df- OSD_NEARFULL
One or more OSDs has exceeded the nearfull threshold. This is an early warning that the cluster is approaching full. Usage by pool can be checked with:
cephadm >ceph df- OSDMAP_FLAGS
One or more cluster flags of interest has been set. With the exception of full, these flags can be set or cleared with:
cephadm >ceph osd set flagcephadm >ceph osd unset flagThese flags include:
- full
The cluster is flagged as full and cannot service writes.
- pauserd, pausewr
Paused reads or writes.
- noup
OSDs are not allowed to start.
- nodown
OSD failure reports are being ignored, such that the monitors will not mark OSDs down.
- noin
OSDs that were previously marked out will not be marked back in when they start.
- noout
Down OSDs will not automatically be marked out after the configured interval.
- nobackfill, norecover, norebalance
Recovery or data rebalancing is suspended.
- noscrub, nodeep_scrub
Scrubbing (see Section 7.5, “Scrubbing”) is disabled.
- notieragent
Cache tiering activity is suspended.
- OSD_FLAGS
One or more OSDs has a per-OSD flag of interest set. These flags include:
- noup
OSD is not allowed to start.
- nodown
Failure reports for this OSD will be ignored.
- noin
If this OSD was previously marked out automatically after a failure, it will not be marked in when it starts.
- noout
If this OSD is down, it will not be automatically marked out after the configured interval.
Per-OSD flags can be set and cleared with:
cephadm >ceph osd add-flag osd-IDcephadm >ceph osd rm-flag osd-ID- OLD_CRUSH_TUNABLES
The CRUSH Map is using very old settings and should be updated. The oldest tunables that can be used (that is the oldest client version that can connect to the cluster) without triggering this health warning is determined by the
mon_crush_min_required_versionconfiguration option.- OLD_CRUSH_STRAW_CALC_VERSION
The CRUSH Map is using an older, non-optimal method for calculating intermediate weight values for straw buckets. The CRUSH Map should be updated to use the newer method (
straw_calc_version=1).- CACHE_POOL_NO_HIT_SET
One or more cache pools is not configured with a hit set to track usage, which prevents the tiering agent from identifying cold objects to flush and evict from the cache. Hit sets can be configured on the cache pool with:
cephadm >ceph osd pool set poolname hit_set_type typecephadm >ceph osd pool set poolname hit_set_period period-in-secondscephadm >ceph osd pool set poolname hit_set_count number-of-hitsetscephadm >ceph osd pool set poolname hit_set_fpp target-false-positive-rateFor more information on cache tiering, see Chapter 11, Cache Tiering.
- OSD_NO_SORTBITWISE
No pre-luminous v12 OSDs are running but the
sortbitwiseflag has not been set. You need to set thesortbitwiseflag before luminous v12 or newer OSDs can start:cephadm >ceph osd set sortbitwise- POOL_FULL
One or more pools has reached its quota and is no longer allowing writes. You can set pool quotas and usage with:
cephadm >ceph df detailYou can either raise the pool quota with
cephadm >ceph osd pool set-quota poolname max_objects num-objectscephadm >ceph osd pool set-quota poolname max_bytes num-bytesor delete some existing data to reduce usage.
- PG_AVAILABILITY
Data availability is reduced, meaning that the cluster is unable to service potential read or write requests for some data in the cluster. Specifically, one or more PGs is in a state that does not allow IO requests to be serviced. Problematic PG states include peering, stale, incomplete, and the lack of active (if those conditions do not clear quickly). Detailed information about which PGs are affected is available from:
cephadm >ceph health detailIn most cases the root cause is that one or more OSDs is currently down. The state of specific problematic PGs can be queried with:
cephadm >ceph tell pgid query- PG_DEGRADED
Data redundancy is reduced for some data, meaning the cluster does not have the desired number of replicas for all data (for replicated pools) or erasure code fragments (for erasure coded pools). Specifically, one or more PGs have either the degraded or undersized flag set (there are not enough instances of that placement group in the cluster), or have not had the clean flag set for some time. Detailed information about which PGs are affected is available from:
cephadm >ceph health detailIn most cases the root cause is that one or more OSDs is currently down. The state of specific problematic PGs can be queried with:
cephadm >ceph tell pgid query- PG_DEGRADED_FULL
Data redundancy may be reduced or at risk for some data because of a lack of free space in the cluster. Specifically, one or more PGs has the backfill_toofull or recovery_toofull flag set, meaning that the cluster is unable to migrate or recover data because one or more OSDs is above the backfillfull threshold.
- PG_DAMAGED
Data scrubbing (see Section 7.5, “Scrubbing”) has discovered some problems with data consistency in the cluster. Specifically, one or more PGs has the inconsistent or snaptrim_error flag is set, indicating an earlier scrub operation found a problem, or that the repair flag is set, meaning a repair for such an inconsistency is currently in progress.
- OSD_SCRUB_ERRORS
Recent OSD scrubs have uncovered inconsistencies.
- CACHE_POOL_NEAR_FULL
A cache tier pool is nearly full. Full in this context is determined by the target_max_bytes and target_max_objects properties on the cache pool. When the pool reaches the target threshold, write requests to the pool may block while data is flushed and evicted from the cache, a state that normally leads to very high latencies and poor performance. The cache pool target size can be adjusted with:
cephadm >ceph osd pool set cache-pool-name target_max_bytes bytescephadm >ceph osd pool set cache-pool-name target_max_objects objectsNormal cache flush and evict activity may also be throttled because of reduced availability or performance of the base tier, or overall cluster load.
Find more information about cache tiering in Chapter 11, Cache Tiering.
- TOO_FEW_PGS
The number of PGs in use is below the configurable threshold of
mon_pg_warn_min_per_osdPGs per OSD. This can lead to suboptimal distribution and balance of data across the OSDs in the cluster reduce overall performance.See Placement Groups for details on calculating an appropriate number of placement groups for your pool.
- TOO_MANY_PGS
The number of PGs in use is above the configurable threshold of
mon_pg_warn_max_per_osdPGs per OSD. This can lead to higher memory usage for OSD daemons, slower peering after cluster state changes (for example OSD restarts, additions, or removals), and higher load on the Ceph Managers and Ceph Monitors.While the
pg_numvalue for existing pools cannot be reduced. Thepgp_numvalue can. This effectively collocates some PGs on the same sets of OSDs, mitigating some of the negative impacts described above. Thepgp_numvalue can be adjusted with:cephadm >ceph osd pool set pool pgp_num value- SMALLER_PGP_NUM
One or more pools has a
pgp_numvalue less thanpg_num. This is normally an indication that the PG count was increased without also increasing the placement behavior. This is normally resolved by settingpgp_numto matchpg_num, triggering the data migration, with:cephadm >ceph osd pool set pool pgp_num pg_num_value- MANY_OBJECTS_PER_PG
One or more pools have an average number of objects per PG that is significantly higher than the overall cluster average. The specific threshold is controlled by the
mon_pg_warn_max_object_skewconfiguration value. This is usually an indication that the pool(s) containing most of the data in the cluster have too few PGs, and/or that other pools that do not contain as much data have too many PGs. The threshold can be raised to silence the health warning by adjusting themon_pg_warn_max_object_skewconfiguration option on the monitors.- POOL_APP_NOT_ENABLED¶
A pool exists that contains one or more objects but has not been tagged for use by a particular application. Resolve this warning by labeling the pool for use by an application. For example, if the pool is used by RBD:
cephadm >rbd pool init pool_nameIf the pool is being used by a custom application 'foo', you can also label it using the low-level command:
cephadm >ceph osd pool application enable foo- POOL_FULL
One or more pools have reached (or is very close to reaching) its quota. The threshold to trigger this error condition is controlled by the
mon_pool_quota_crit_thresholdconfiguration option. Pool quotas can be adjusted up or down (or removed) with:cephadm >ceph osd pool set-quota pool max_bytes bytescephadm >ceph osd pool set-quota pool max_objects objectsSetting the quota value to 0 will disable the quota.
- POOL_NEAR_FULL
One or more pools are approaching their quota. The threshold to trigger this warning condition is controlled by the
mon_pool_quota_warn_thresholdconfiguration option. Pool quotas can be adjusted up or down (or removed) with:cephadm >ceph osd osd pool set-quota pool max_bytes bytescephadm >ceph osd osd pool set-quota pool max_objects objectsSetting the quota value to 0 will disable the quota.
- OBJECT_MISPLACED
One or more objects in the cluster are not stored on the node where the cluster wants it. This is an indication that data migration caused by a recent cluster change has not yet completed. Misplaced data is not a dangerous condition in itself. Data consistency is never at risk, and old copies of objects are never removed until the desired number of new copies (in the desired locations) are present.
- OBJECT_UNFOUND
One or more objects in the cluster cannot be found. Specifically, the OSDs know that a new or updated copy of an object should exist, but a copy of that version of the object has not been found on OSDs that are currently online. Read or write requests to the 'unfound' objects will be blocked. Ideally, a down OSD can be brought back online that has the more recent copy of the unfound object. Candidate OSDs can be identified from the peering state for the PG(s) responsible for the unfound object:
cephadm >ceph tell pgid query- REQUEST_SLOW
One or more OSD requests is taking a long time to process. This can be an indication of extreme load, a slow storage device, or a software bug. You can query the request queue on the OSD(s) in question with the following command executed from the OSD host:
cephadm >ceph daemon osd.id opsYou can see a summary of the slowest recent requests:
cephadm >ceph daemon osd.id dump_historic_opsYou can find the location of an OSD with:
cephadm >ceph osd find osd.id- REQUEST_STUCK
One or more OSD requests have been blocked for a longer time, for example 4096 seconds. This is an indication that either the cluster has been unhealthy for an extended period of time (for example not enough running OSDs or inactive PGs) or there is some internal problem with the OSD.
- PG_NOT_SCRUBBED
One or more PGs have not been scrubbed (see Section 7.5, “Scrubbing”) recently. PGs are normally scrubbed every
mon_scrub_intervalseconds, and this warning triggers whenmon_warn_not_scrubbedsuch intervals have elapsed without a scrub. PGs will not scrub if they are not flagged as clean, which may happen if they are misplaced or degraded (see PG_AVAILABILITY and PG_DEGRADED above). You can manually initiate a scrub of a clean PG with:cephadm >ceph pg scrub pgid- PG_NOT_DEEP_SCRUBBED
One or more PGs has not been deep scrubbed (see Section 7.5, “Scrubbing”) recently. PGs are normally scrubbed every
osd_deep_mon_scrub_intervalseconds, and this warning triggers whenmon_warn_not_deep_scrubbedseconds have elapsed without a scrub. PGs will not (deep)scrub if they are not flagged as clean, which may happen if they are misplaced or degraded (see PG_AVAILABILITY and PG_DEGRADED above). You can manually initiate a scrub of a clean PG with:cephadm >ceph pg deep-scrub pgid
4.3 Watching a Cluster #
You can find the immediate state of the cluster using ceph
-s. For example, a tiny Ceph cluster consisting of one monitor,
and two OSDs may print the following when a workload is running:
cephadm > ceph -s
cluster:
id: ea4cf6ce-80c6-3583-bb5e-95fa303c893f
health: HEALTH_WARN
too many PGs per OSD (408 > max 300)
services:
mon: 3 daemons, quorum ses5min1,ses5min3,ses5min2
mgr: ses5min1(active), standbys: ses5min3, ses5min2
mds: cephfs-1/1/1 up {0=ses5min3=up:active}
osd: 4 osds: 4 up, 4 in
rgw: 1 daemon active
data:
pools: 8 pools, 544 pgs
objects: 253 objects, 3821 bytes
usage: 6252 MB used, 13823 MB / 20075 MB avail
pgs: 544 active+cleanThe output provides the following information:
Cluster ID
Cluster health status
The monitor map epoch and the status of the monitor quorum
The OSD map epoch and the status of OSDs
The status of Ceph Managers.
The status of Object Gateways.
The placement group map version
The number of placement groups and pools
The notional amount of data stored and the number of objects stored; and,
The total amount of data stored.
Tip: How Ceph Calculates Data Usage
The used value reflects the actual amount of raw storage
used. The xxx GB / xxx GB value means the amount
available (the lesser number) of the overall storage capacity of the
cluster. The notional number reflects the size of the stored data before it
is replicated, cloned or snapshot. Therefore, the amount of data actually
stored typically exceeds the notional amount stored, because Ceph creates
replicas of the data and may also use storage capacity for cloning and
snapshotting.
Other commands that display immediate status information are:
ceph pg statceph osd pool statsceph dfceph df detail
To get the information updated in real time, put any of these commands
(including ceph -s) as an argument of the
watch command:
root # watch -n 10 'ceph -s'Press Ctrl–C when you are tired of watching.
4.4 Checking a Cluster's Usage Stats #
To check a cluster’s data usage and distribution among pools, use the
ceph df command. To get more details, use ceph
df detail.
cephadm > ceph df
GLOBAL:
SIZE AVAIL RAW USED %RAW USED
65886G 45826G 7731M 16
POOLS:
NAME ID USED %USED MAX AVAIL OBJECTS
data 1 1726M 10 17676G 1629
rbd 4 5897M 27 22365G 3547
ecpool 6 69M 0.2 35352G 31
[...]
The GLOBAL section of the output provides an overview of
the amount of storage your cluster uses for your data.
SIZE: The overall storage capacity of the cluster.AVAIL: The amount of free space available in the cluster.RAW USED: The amount of raw storage used.% RAW USED: The percentage of raw storage used. Use this number in conjunction with thefull ratioandnear full ratioto ensure that you are not reaching your cluster’s capacity. See Storage Capacity for additional details.Note: Cluster Fill Level
When a raw storage fill level is getting close to 100%, you need to add new storage to the cluster. A higher usage may lead to single full OSDs and cluster health problems.
Use the command
ceph osd df treeto list the fill level of all OSDs.
The POOLS section of the output provides a list of pools
and the notional usage of each pool. The output from this section
does not reflect replicas, clones or snapshots. For
example, if you store an object with 1MB of data, the notional usage will be
1MB, but the actual usage may be 2MB or more depending on the number of
replicas, clones and snapshots.
NAME: The name of the pool.ID: The pool ID.USED: The notional amount of data stored in kilobytes, unless the number appends M for megabytes or G for gigabytes.%USED: The notional percentage of storage used per pool.MAX AVAIL: The maximum available space in the given pool.OBJECTS: The notional number of objects stored per pool.
Note
The numbers in the POOLS section are notional. They are not inclusive of the number of replicas, snapshots or clones. As a result, the sum of the USED and %USED amounts will not add up to the RAW USED and %RAW USED amounts in the %GLOBAL section of the output.
4.5 Checking OSD Status #
You can check OSDs to ensure they are up and on by executing:
cephadm > ceph osd stator
cephadm > ceph osd dumpYou can also view OSDs according to their position in the CRUSH map.
cephadm > ceph osd treeCeph will print a CRUSH tree with a host, its OSDs, whether they are up and their weight.
# id weight type name up/down reweight -1 3 pool default -3 3 rack mainrack -2 3 host osd-host 0 1 osd.0 up 1 1 1 osd.1 up 1 2 1 osd.2 up 1
4.6 Checking for Full OSDs #
Ceph prevents you from writing to a full OSD so that you do not lose data.
In an operational cluster, you should receive a warning when your cluster is
getting near its full ratio. The mon osd full ratio
defaults to 0.95, or 95% of capacity before it stops clients from writing
data. The mon osd nearfull ratio defaults to 0.85, or 85%
of capacity, when it generates a health warning.
Full OSD nodes will be reported by ceph health:
cephadm > ceph health
HEALTH_WARN 1 nearfull osds
osd.2 is near full at 85%or
cephadm > ceph health
HEALTH_ERR 1 nearfull osds, 1 full osds
osd.2 is near full at 85%
osd.3 is full at 97%The best way to deal with a full cluster is to add new OSD hosts/disks allowing the cluster to redistribute data to the newly available storage.
Tip: Preventing Full OSDs
After an OSD becomes full—it uses 100% of its disk space—it will normally crash quickly without warning. Following are a few tips to remember when administering OSD nodes.
Each OSD's disk space (usually mounted under
/var/lib/ceph/osd/osd-{1,2..}) needs to be placed on a dedicated underlying disk or partition.Check the Ceph configuration files and make sure that Ceph does not store its log file to the disks/partitions dedicated for use by OSDs.
Make sure that no other process writes to the disks/partitions dedicated for use by OSDs.
4.7 Checking Monitor Status #
After you start the cluster and before first reading and/or writing data, check the Ceph Monitors quorum status. When the cluster is already serving requests, check the Ceph Monitors status periodically to ensure that they are running.
To display the monitor map, execute the following:
cephadm > ceph mon stator
cephadm > ceph mon dumpTo check the quorum status for the monitor cluster, execute the following:
cephadm > ceph quorum_statusCeph will return the quorum status. For example, a Ceph cluster consisting of three monitors may return the following:
{ "election_epoch": 10,
"quorum": [
0,
1,
2],
"monmap": { "epoch": 1,
"fsid": "444b489c-4f16-4b75-83f0-cb8097468898",
"modified": "2011-12-12 13:28:27.505520",
"created": "2011-12-12 13:28:27.505520",
"mons": [
{ "rank": 0,
"name": "a",
"addr": "192.168.1.10:6789\/0"},
{ "rank": 1,
"name": "b",
"addr": "192.168.1.11:6789\/0"},
{ "rank": 2,
"name": "c",
"addr": "192.168.1.12:6789\/0"}
]
}
}4.8 Checking Placement Group States #
Placement groups map objects to OSDs. When you monitor your placement
groups, you will want them to be active and
clean. For a detailed discussion, refer to
Monitoring
OSDs and Placement Groups.
4.9 Using the Admin Socket #
The Ceph admin socket allows you to query a daemon via a socket interface.
By default, Ceph sockets reside under /var/run/ceph.
To access a daemon via the admin socket, log in to the host running the
daemon and use the following command:
cephadm > ceph --admin-daemon /var/run/ceph/socket-nameTo view the available admin socket commands, execute the following command:
cephadm > ceph --admin-daemon /var/run/ceph/socket-name helpThe admin socket command enables you to show and set your configuration at runtime. Refer to Viewing a Configuration at Runtimefor details.
Additionally, you can set configuration values at runtime directly (the
admin socket bypasses the monitor, unlike ceph tell
daemon-type.id
injectargs, which relies on the monitor but does not require you to log in
directly to the host in question).