33 Hints and Tips #
The chapter provides information to help you enhance performance of your Ceph cluster and provides tips how to set the cluster up.
33.1 Identifying Orphaned Partitions #
To identify possibly orphaned journal/WAL/DB devices, follow these steps:
Pick the device that may have orphaned partitions and save the list of its partitions to a file:
root@minion >ls /dev/sdd?* > /tmp/partitionsRun
readlinkagainst all block.wal, block.db, and journal devices, and compare the output to the previously saved list of partitions:root@minion >readlink -f /var/lib/ceph/osd/ceph-*/{block.wal,block.db,journal} \ | sort | comm -23 /tmp/partitions -The output is the list of partitions that are not used by Ceph.
Remove the orphaned partitions that do not belong to Ceph with your preferred command (for example,
fdisk,parted, orsgdisk).
33.2 Adjusting Scrubbing #
By default, Ceph performs light scrubbing daily (find more details in Section 20.6, “Scrubbing”) and deep scrubbing weekly. Light scrubbing checks object sizes and checksums to ensure that placement groups are storing the same object data. Deep scrubbing checks an object’s content with that of its replicas to ensure that the actual contents are the same. The price for checking data integrity is increased I/O load on the cluster during the scrubbing procedure.
The default settings allow Ceph OSDs to initiate scrubbing at inappropriate times, such as during periods of heavy loads. Customers may experience latency and poor performance when scrubbing operations conflict with their operations. Ceph provides several scrubbing settings that can limit scrubbing to periods with lower loads or during off-peak hours.
If the cluster experiences high loads during the day and low loads late at night, consider restricting scrubbing to night time hours, such as 11pm till 6am:
[osd] osd_scrub_begin_hour = 23 osd_scrub_end_hour = 6
If time restriction is not an effective method of determining a scrubbing
schedule, consider using the osd_scrub_load_threshold
option. The default value is 0.5, but it could be modified for low load
conditions:
[osd] osd_scrub_load_threshold = 0.25
33.3 Stopping OSDs without Rebalancing #
You may need to stop OSDs for maintenance periodically. If you do not want
CRUSH to automatically rebalance the cluster in order to avoid huge data
transfers, set the cluster to noout first:
root@minion > ceph osd set noout
When the cluster is set to noout, you can begin stopping
the OSDs within the failure domain that requires maintenance work:
root@minion > systemctl stop ceph-osd@OSD_NUMBER.serviceFind more information in Section 16.1.2, “Starting, Stopping, and Restarting Individual Services”.
After you complete the maintenance, start OSDs again:
root@minion > systemctl start ceph-osd@OSD_NUMBER.service
After OSD services are started, unset the cluster from
noout:
cephadm@adm > ceph osd unset noout33.4 Time Synchronization of Nodes #
Ceph requires precise time synchronization between all nodes.
We recommend synchronizing all Ceph cluster nodes with at least three reliable time sources that are located on the internal network. The internal time sources can point to a public time server or have their own time source.
Important: Public Time Servers
Do not synchronize all Ceph cluster nodes directly with remote public time servers. With such a configuration, each node in the cluster has its own NTP daemon that communicates continually over the Internet with a set of three or four time servers that may provide slightly different times. This solution introduces a large degree of latency variability that makes it difficult or impossible to keep the clock drift under 0.05 seconds, which is what the Ceph monitors require.
For details how to set up the NTP server refer to SUSE Linux Enterprise Server Administration Guide.
Then to change the time on your cluster, do the following:
Important: Setting Time
You may face a situation when you need to set the time back, for example if the time changes from the summer to the standard time. We do not recommend to move the time backward for a longer period than the cluster is down. Moving the time forward does not cause any trouble.
Procedure 33.1: Time Synchronization on the Cluster #
Stop all clients accessing the Ceph cluster, especially those using iSCSI.
Shut down your Ceph cluster. On each node run:
root #systemctl stop ceph.targetNote
If you use Ceph and SUSE OpenStack Cloud, stop also the SUSE OpenStack Cloud.
Verify that your NTP server is set up correctly—all
chronyddaemons get their time from a source or sources in the local network.Set the correct time on your NTP server.
Verify that NTP is running and working properly, on all nodes run:
root #systemctl status chronyd.serviceStart all monitoring nodes and verify that there is no clock skew:
root #systemctl start targetStart all OSD nodes.
Start other Ceph services.
Start the SUSE OpenStack Cloud if you have it.
33.5 Checking for Unbalanced Data Writing #
When data is written to OSDs evenly, the cluster is considered balanced. Each OSD within a cluster is assigned its weight. The weight is a relative number and tells Ceph how much of the data should be written to the related OSD. The higher the weight, the more data will be written. If an OSD has zero weight, no data will be written to it. If the weight of an OSD is relatively high compared to other OSDs, a large portion of the data will be written there, which makes the cluster unbalanced.
Unbalanced clusters have poor performance, and in the case that an OSD with a high weight suddenly crashes, a lot of data needs to be moved to other OSDs, which slows down the cluster as well.
To avoid this, you should regularly check OSDs for the amount of data writing. If the amount is between 30% and 50% of the capacity of a group of OSDs specified by a given ruleset, you need to reweight the OSDs. Check for individual disks and find out which of them fill up faster than the others (or are generally slower), and lower their weight. The same is valid for OSDs where not enough data is written—you can increase their weight to have Ceph write more data to them. In the following example, you will find out the weight of an OSD with ID 13, and reweight it from 3 to 3.05:
$ ceph osd tree | grep osd.13 13 3 osd.13 up 1 $ ceph osd crush reweight osd.13 3.05 reweighted item id 13 name 'osd.13' to 3.05 in crush map $ ceph osd tree | grep osd.13 13 3.05 osd.13 up 1
Tip: OSD Reweight by Utilization
The ceph osd reweight-by-utilization
threshold command automates the process of
reducing the weight of OSDs which are heavily overused. By default it will
adjust the weights downward on OSDs which reached 120% of the average
usage, but if you include threshold it will use that percentage instead.
33.6 Btrfs Subvolume for /var/lib/ceph on Ceph Monitor Nodes #
SUSE Linux Enterprise is by default installed on a Btrfs partition. Ceph Monitors store their state
and database in the /var/lib/ceph directory. To prevent
corruption of a Ceph Monitor from a system rollback of a previous snapshot, create
a Btrfs subvolume for /var/lib/ceph. A dedicated
subvolume excludes the monitor data from snapshots of the root subvolume.
Tip
Create the /var/lib/ceph subvolume before running
DeepSea stage 0 because stage 0 installs Ceph related packages and
creates the /var/lib/ceph directory.
DeepSea stage 3 then verifies whether @/var/lib/ceph
is a Btrfs subvolume and fails if it is a normal directory.
33.6.1 Requirements #
33.6.1.1 New Deployments #
Salt and DeepSea need to be properly installed and working.
33.6.1.2 Existing Deployments #
If your cluster is already installed, the following requirements must be met:
Nodes are upgraded to SUSE Enterprise Storage 6 and cluster is under DeepSea control.
Ceph cluster is up and healthy.
Upgrade process has synchronized Salt and DeepSea modules to all minion nodes.
33.6.2 Steps Required during a New Cluster Deployment #
33.6.2.1 Before Running DeepSea stage 0 #
Prior to running DeepSea stage 0, apply the following commands to each of the Salt minions that will become Ceph Monitors:
root@master #salt 'MONITOR_NODES' saltutil.sync_allroot@master #salt 'MONITOR_NODES' state.apply ceph.subvolume
The ceph.subvolume command does the following:
Creates
/var/lib/cephas a@/var/lib/cephBtrfs subvolume.Mounts the new subvolume and updates
/etc/fstabappropriately.
33.6.2.2 DeepSea stage 3 Validation Fails #
If you forgot to run the commands mentioned in
Section 33.6.2.1, “Before Running DeepSea stage 0” before running stage 0, the
/var/lib/ceph subdirectory already exists, causing
DeepSea stage 3 validation failure. To convert it into a subvolume, do
the following:
Change directory to
/var/lib:cephadm@mon >cd /var/libBack up the current content of the
cephsubdirectory:cephadm@mon >sudo mv ceph ceph-Create the subvolume and, mount it, and update
/etc/fstab:root@master #salt 'MONITOR_NODES' state.apply ceph.subvolumeChange to the backup subdirectory, synchronize its content with the new subvolume, then remove it:
cephadm@mon >cd /var/lib/ceph-cephadm@mon >rsync -av . ../cephcephadm@mon >cd ..cephadm@mon >rm -rf ./ceph-
33.6.3 Steps Required during Cluster Upgrade #
On SUSE Enterprise Storage 5.5, the /var
directory is not on a Btrfs subvolume, but its subfolders (such as
/var/log or /var/cache) are Btrfs
subvolumes under '@'. Creating @/var/lib/ceph
subvolumes requires mounting the '@' subvolume first (it is not mounted by
default) and creating the @/var/lib/ceph subvolume
under it.
Following are example commands that illustrate the process:
root #mkdir -p /mnt/btrfsroot #mount -o subvol=@ ROOT_DEVICE /mnt/btrfsroot #btrfs subvolume create /mnt/btrfs/var/lib/cephroot #umount /mnt/btrfs
At this point the @/var/lib/ceph subvolume is created
and you can continue as described in
Section 33.6.2, “Steps Required during a New Cluster Deployment”.
33.6.4 Manual Setup #
Automatic setup of the @/var/lib/ceph Btrfs subvolume
on the Ceph Monitor nodes may not be suitable for all scenarios. You can migrate
your /var/lib/ceph directory to a
@/var/lib/ceph subvolume by following these steps:
Terminate running Ceph processes.
Unmount OSDs on the node.
Change to the backup subdirectory, synchronize its content with the new subvolume, then remove it:
cephadm@mon >cd /var/lib/ceph-cephadm@mon >rsync -av . ../cephcephadm@mon >cd ..cephadm@mon >rm -rf ./ceph-Remount OSDs.
Restart Ceph daemons.
33.6.5 For More Information #
Find more detailed information about manual setup in the file
/srv/salt/ceph/subvolume/README.md on the Salt master
node.
33.7 Increasing File Descriptors #
For OSD daemons, the read/write operations are critical to keep the Ceph cluster balanced. They often need to have many files open for reading and writing at the same time. On the OS level, the maximum number of simultaneously open files is called 'maximum number of file descriptors'.
To prevent OSDs from running out of file descriptors, you can override the
OS default value and specify the number in
/etc/ceph/ceph.conf, for example:
max_open_files = 131072
After you change max_open_files, you need to restart the
OSD service on the relevant Ceph node.
33.8 Integration with Virtualization Software #
33.8.1 Storing KVM Disks in Ceph Cluster #
You can create a disk image for KVM-driven virtual machine, store it in a
Ceph pool, optionally convert the content of an existing image to it, and
then run the virtual machine with qemu-kvm making use of
the disk image stored in the cluster. For more detailed information, see
Chapter 32, Ceph as a Back-end for QEMU KVM Instance.
33.8.2 Storing libvirt Disks in Ceph Cluster #
Similar to KVM (see Section 33.8.1, “Storing KVM Disks in Ceph Cluster”), you
can use Ceph to store virtual machines driven by libvirt. The advantage
is that you can run any libvirt-supported virtualization solution, such
as KVM, Xen, or LXC. For more information, see
Chapter 31, Using libvirt with Ceph.
33.8.3 Storing Xen Disks in Ceph Cluster #
One way to use Ceph for storing Xen disks is to make use of libvirt
as described in Chapter 31, Using libvirt with Ceph.
Another option is to make Xen talk to the rbd
block device driver directly:
If you have no disk image prepared for Xen, create a new one:
cephadm@adm >rbd create myimage --size 8000 --pool mypoolList images in the pool
mypooland check if your new image is there:cephadm@adm >rbd list mypoolCreate a new block device by mapping the
myimageimage to therbdkernel module:cephadm@adm >rbd map --pool mypool myimageTip: User Name and Authentication
To specify a user name, use
--id user-name. Moreover, if you usecephxauthentication, you must also specify a secret. It may come from a keyring or a file containing the secret:cephadm@adm >rbd map --pool rbd myimage --id admin --keyring /path/to/keyringor
cephadmrbd map --pool rbd myimage --id admin --keyfile /path/to/fileList all mapped devices:
rbd showmappedid pool image snap device 0 mypool myimage - /dev/rbd0Now you can configure Xen to use this device as a disk for running a virtual machine. You can for example add the following line to the
xl-style domain configuration file:disk = [ '/dev/rbd0,,sda', '/dev/cdrom,,sdc,cdrom' ]
33.9 Firewall Settings for Ceph #
Warning: DeepSea Stages Fail with Firewall
DeepSea deployment stages fail when firewall is active (and even configured). To pass the stages correctly, you need to either turn the firewall off by running
root # systemctl stop SuSEfirewall2.service
or set the FAIL_ON_WARNING option to 'False' in
/srv/pillar/ceph/stack/global.yml:
FAIL_ON_WARNING: False
We recommend protecting the network cluster communication with SUSE Firewall. You can edit its configuration by selecting / / / .
Following is a list of Ceph related services and numbers of the ports that they normally use:
- Ceph Monitor
Enable the service or port 6789 (TCP).
- Ceph OSD or Metadata Server
Enable the service or ports 6800-7300 (TCP).
- iSCSI Gateway
Open port 3260 (TCP).
- Object Gateway
Open the port where Object Gateway communication occurs. It is set in
/etc/ceph.confon the line starting withrgw frontends =. Default is 80 for HTTP and 443 for HTTPS (TCP).- NFS Ganesha
By default, NFS Ganesha uses ports 2049 (NFS service, TCP) and 875 (rquota support, TCP). Refer to Section 30.2.1.4, “Changing Default NFS Ganesha Ports” for more information on changing the default NFS Ganesha ports.
- Apache based services, such as SMT, or SUSE Manager
Open ports 80 for HTTP and 443 for HTTPS (TCP).
- SSH
Open port 22 (TCP).
- NTP
Open port 123 (UDP).
- Salt
Open ports 4505 and 4506 (TCP).
- Grafana
Open port 3000 (TCP).
- Prometheus
Open port 9100 (TCP).
33.10 Testing Network Performance #
To test the network performance, DeepSea's net runner
provides the following commands:
A simple ping to all nodes:
root@master #salt-runnet.ping Succeeded: 9 addresses from 9 minions average rtt 1.35 msA jumbo ping to all nodes:
root@master #salt-runnet.jumbo_ping Succeeded: 9 addresses from 9 minions average rtt 2.13 msA bandwidth test:
root@master #salt-runnet.iperf Fastest 2 hosts: |_ - 192.168.58.106 - 2981 Mbits/sec |_ - 192.168.58.107 - 2967 Mbits/sec Slowest 2 hosts: |_ - 192.168.58.102 - 2857 Mbits/sec |_ - 192.168.58.103 - 2842 Mbits/secTip: Stop 'iperf3' Processes Manually
When running a test using the
net.iperfrunner, the 'iperf3' server processes that are started do not stop automatically when a test is completed. To stop the processes, use the following runner:root@master #salt '*' multi.kill_iperf_cmd
33.11 How to Locate Physical Disks Using LED Lights #
This section describes using libstoragemgmt and/or
third party tools to adjust the LED lights on physical disks. This
capability may not be available for all hardware platforms.
Matching an OSD disk to a physical disk can be challenging, especially on
nodes with a high density of disks. Some hardware environments include LED
lights that can be adjusted via software to flash or illuminate a different
color for identification purposes. SUSE Enterprise Storage offers support for this
capability through Salt, libstoragemgmt, and
third party tools specific to the hardware in use. The configuration for
this capability is defined in the
/srv/pillar/ceph/disk_led.sls Salt pillar:
root # cat /srv/pillar/ceph/disk_led.sls
# This is the default configuration for the storage enclosure LED blinking.
# The placeholder {device_file} will be replaced with the device file of
# the disk when the command is executed.
#
# Have a look into the /srv/pillar/ceph/README file to find out how to
# customize this configuration per minion/host.
disk_led:
cmd:
ident:
'on': lsmcli local-disk-ident-led-on --path '{device_file}'
'off': lsmcli local-disk-ident-led-off --path '{device_file}'
fault:
'on': lsmcli local-disk-fault-led-on --path '{device_file}'
'off': lsmcli local-disk-fault-led-off --path '{device_file}'
The default configuration for disk_led.sls offers disk
LED support through the libstoragemgmt layer. The
libstoragemgmt layer provides this support through
a hardware-specific plug-in and third party tools. The default behavior can
be customized using DeepSea by adding the following for the
ledmon package and ledctl tool to
/srv/pillar/ceph/stack/global.yml (or any other YAML
file in /stack/ceph/):
disk_led:
cmd:
ident:
'on': ledctl locate='{device_file}'
'off': ledctl locate_off='{device_file}'
fault:
'on': ledctl locate='{device_file}'
'off': ledctl locate_off='{device_file}'
If the customization should only apply to a special node (minion), then the
file stack/ceph/minions/{{minion}}.yml needs to be
used.
With or without libstoragemgmt, third party tools
may be required to adjust LED lights. These third party tools are available
through various hardware vendors. Some of the common vendors and tools are:
Table 33.1: Third Party Storage Tools #
| Vendor/Disk Controller | Tool |
|---|---|
| HPE SmartArray | hpssacli |
| LSI MegaRAID | storcli |
SUSE Linux Enterprise Server also provides the ledmon package and
ledctl tool. This tool may also work for hardware
environments utilizing Intel storage enclosures. Proper syntax when using
this tool is as follows:
root # cat /srv/pillar/ceph/disk_led.sls
disk_led:
cmd:
ident:
'on': ledctl locate='{device_file}'
'off': ledctl locate_off='{device_file}'
fault:
'on': ledctl locate='{device_file}'
'off': ledctl locate_off='{device_file}'If you are on supported hardware, with all required third party tools, LEDs can be enabled or disabled using the following command syntax from the Salt master node:
root # salt-run disk_led.device NODE DISK fault|ident on|off
For example, to enable or disable LED identification or fault lights on
/dev/sdd on OSD node srv16.ceph,
run the following:
root #salt-run disk_led.device srv16.ceph sdd ident onroot #salt-run disk_led.device srv16.ceph sdd ident offroot #salt-run disk_led.device srv16.ceph sdd fault onroot #salt-run disk_led.device srv16.ceph sdd fault off
Note: Device Naming
The device name used in the salt-run command needs to
match the name recognized by Salt. The following command can be used to
display these names:
root@master # salt 'minion_name' grains.get disks
In many environments, the /srv/pillar/ceph/disk_led.sls
configuration will require changes in order to adjust the LED lights for
specific hardware needs. Simple changes may be performed by replacing
lsmcli with another tool, or adjusting command line
parameters. Complex changes may be accomplished by calling an external
script in place of the lsmcli command. When making any
changes to /srv/pillar/ceph/disk_led.sls, follow these
steps:
Make required changes to
/srv/pillar/ceph/disk_led.slson the Salt master node.Verify that the changes are reflected correctly in the pillar data:
root #salt 'SALT MASTER*' pillar.get disk_ledRefresh the pillar data on all nodes using:
root #salt '*' saltutil.pillar_refresh
It is possible to use an external script to directly use third-party tools
to adjust LED lights. The following examples show how to adjust
/srv/pillar/ceph/disk_led.sls to support an external
script, and two sample scripts for HP and LSI environments.
Modified /srv/pillar/ceph/disk_led.sls which calls an
external script:
root # cat /srv/pillar/ceph/disk_led.sls
disk_led:
cmd:
ident:
'on': /usr/local/bin/flash_led.sh '{device_file}' on
'off': /usr/local/bin/flash_led.sh '{device_file}' off
fault:
'on': /usr/local/bin/flash_led.sh '{device_file}' on
'off': /usr/local/bin/flash_led.sh '{device_file}' off
Sample script for flashing LED lights on HP hardware using the
hpssacli utilities:
root # cat /usr/local/bin/flash_led_hp.sh
#!/bin/bash
# params:
# $1 device (e.g. /dev/sda)
# $2 on|off
FOUND=0
MAX_CTRLS=10
MAX_DISKS=50
for i in $(seq 0 $MAX_CTRLS); do
# Search for valid controllers
if hpssacli ctrl slot=$i show summary >/dev/null; then
# Search all disks on the current controller
for j in $(seq 0 $MAX_DISKS); do
if hpssacli ctrl slot=$i ld $j show | grep -q $1; then
FOUND=1
echo "Found $1 on ctrl=$i, ld=$j. Turning LED $2."
hpssacli ctrl slot=$i ld $j modify led=$2
break;
fi
done
[[ "$FOUND" = "1" ]] && break
fi
done
Sample script for flashing LED lights on LSI hardware using the
storcli utilities:
root # cat /usr/local/bin/flash_led_lsi.sh
#!/bin/bash
# params:
# $1 device (e.g. /dev/sda)
# $2 on|off
[[ "$2" = "on" ]] && ACTION="start" || ACTION="stop"
# Determine serial number for the disk
SERIAL=$(lshw -class disk | grep -A2 $1 | grep serial | awk '{print $NF}')
if [ ! -z "$SERIAL" ]; then
# Search for disk serial number across all controllers and enclosures
DEVICE=$(/opt/MegaRAID/storcli/storcli64 /call/eall/sall show all | grep -B6 $SERIAL | grep Drive | awk '{print $2}')
if [ ! -z "$DEVICE" ]; then
echo "Found $1 on device $DEVICE. Turning LED $2."
/opt/MegaRAID/storcli/storcli64 $DEVICE $ACTION locate
else
echo "Device not found!"
exit -1
fi
else
echo "Disk serial number not found!"
exit -1
fi