20 Hints and Tips #
The chapter provides information to help you enhance performance of your Ceph cluster and provides tips how to set the cluster up.
20.1 Identifying Orphaned Partitions #
To identify possibly orphaned journal/WAL/DB devices, follow these steps:
Pick the device that may have orphaned partitions and save the list of its partitions to a file:
root@minion >ls /dev/sdd?* > /tmp/partitionsRun
readlinkagainst all block.wal, block.db, and journal devices, and compare the output to the previously saved list of partitions:root@minion >readlink -f /var/lib/ceph/osd/ceph-*/{block.wal,block.db,journal} \ | sort | comm -23 /tmp/partitions -The output is the list of partitions that are not used by Ceph.
Remove the orphaned partitions that do not belong to Ceph with your preferred command (for example
fdisk,parted, orsgdisk).
20.2 Adjusting Scrubbing #
By default, Ceph performs light scrubbing (find more details in Section 7.5, “Scrubbing”) daily and deep scrubbing weekly. Light scrubbing checks object sizes and checksums to ensure that placement groups are storing the same object data. Deep scrubbing checks an object’s content with that of its replicas to ensure that the actual contents are the same. The price for checking data integrity is increased I/O load on the cluster during the scrubbing procedure.
The default settings allow Ceph OSDs to initiate scrubbing at inappropriate times, such as during periods of heavy loads. Customers may experience latency and poor performance when scrubbing operations conflict with their operations. Ceph provides several scrubbing settings that can limit scrubbing to periods with lower loads or during off-peak hours.
If the cluster experiences high loads during the day and low loads late at night, consider restricting scrubbing to night time hours, such as 11pm till 6am:
[osd] osd_scrub_begin_hour = 23 osd_scrub_end_hour = 6
If time restriction is not an effective method of determining a scrubbing
schedule, consider using the osd_scrub_load_threshold
option. The default value is 0.5, but it could be modified for low load
conditions:
[osd] osd_scrub_load_threshold = 0.25
20.3 Stopping OSDs without Rebalancing #
You may need to stop OSDs for maintenance periodically. If you do not want
CRUSH to automatically rebalance the cluster in order to avoid huge data
transfers, set the cluster to noout first:
root@minion > ceph osd set noout
When the cluster is set to noout, you can begin stopping
the OSDs within the failure domain that requires maintenance work:
root@minion > systemctl stop ceph-osd@OSD_NUMBER.serviceFind more information in Section 3.1.2, “Starting, Stopping, and Restarting Individual Services”.
After you complete the maintenance, start OSDs again:
root@minion > systemctl start ceph-osd@OSD_NUMBER.service
After OSD services are started, unset the cluster from
noout:
cephadm > ceph osd unset noout20.4 Time Synchronization of Nodes #
Ceph requires precise time synchronization between all nodes.
We recommend synchronizing all Ceph cluster nodes with at least three reliable time sources that are located on the internal network. The internal time sources can point to a public time server or have their own time source.
Important: Public Time Servers
Do not synchronize all Ceph cluster nodes directly with remote public time servers. With such a configuration, each node in the cluster has its own NTP daemon that communicates continually over the Internet with a set of three or four time servers that may provide slightly different time. This solution introduces a large degree of latency variability that makes it difficult or impossible to keep the clock drift under 0.05 seconds which is what the Ceph monitors require.
For details how to set up the NTP server refer to SUSE Linux Enterprise Server Administration Guide.
Then to change the time on your cluster, do the following:
Important: Setting Time
You may face a situation when you need to set the time back, for example if the time changes from the summer to the standard time. We do not recommend to move the time backward for a longer period than the cluster is down. Moving the time forward does not cause any trouble.
Procedure 20.1: Time Synchronization on the Cluster #
Stop all clients accessing the Ceph cluster, especially those using iSCSI.
Shut down your Ceph cluster. On each node run:
root #systemctl stop ceph.targetNote
If you use Ceph and SUSE OpenStack Cloud, stop also the SUSE OpenStack Cloud.
Verify that your NTP server is set up correctly—all ntpd daemons get their time from a source or sources in the local network.
Set the correct time on your NTP server.
Verify that NTP is running and working properly, on all nodes run:
root #systemctl status ntpd.serviceor
root #ntpq -pStart all monitoring nodes and verify that there is no clock skew:
root #systemctl start targetStart all OSD nodes.
Start other Ceph services.
Start the SUSE OpenStack Cloud if you have it.
20.5 Checking for Unbalanced Data Writing #
When data is written to OSDs evenly, the cluster is considered balanced. Each OSD within a cluster is assigned its weight. The weight is a relative number and tells Ceph how much of the data should be written to the related OSD. The higher the weight, the more data will be written. If an OSD has zero weight, no data will be written to it. If the weight of an OSD is relatively high compared to other OSDs, a large portion of the data will be written there, which makes the cluster unbalanced.
Unbalanced clusters have poor performance, and in the case that an OSD with a high weight suddenly crashes, a lot of data needs to be moved to other OSDs, which slows down the cluster as well.
To avoid this, you should regularly check OSDs for the amount of data writing. If the amount is between 30% and 50% of the capacity of a group of OSDs specified by a given rule set, you need to reweight the OSDs. Check for individual disks and find out which of them fill up faster than the others (or are generally slower), and lower their weight. The same is valid for OSDs where not enough data is written—you can increase their weight to have Ceph write more data to them. In the following example, you will find out the weight of an OSD with ID 13, and reweight it from 3 to 3.05:
$ ceph osd tree | grep osd.13 13 3 osd.13 up 1 $ ceph osd crush reweight osd.13 3.05 reweighted item id 13 name 'osd.13' to 3.05 in crush map $ ceph osd tree | grep osd.13 13 3.05 osd.13 up 1
Tip: OSD Reweight by Utilization
The ceph osd reweight-by-utilization
threshold command automates the process of
reducing the weight of OSDs which are heavily overused. By default it will
adjust the weights downward on OSDs which reached 120% of the average
usage, but if you include threshold it will use that percentage instead.
20.6 Btrfs Sub-volume for /var/lib/ceph #
SUSE Linux Enterprise by default is installed on a Btrfs partition. The directory
/var/lib/ceph should be excluded from Btrfs snapshots
and rollbacks, especially when a MON is running on the node. DeepSea
provides the fs runner that can set up a sub-volume for
this path.
20.6.1 Requirements for new Installation #
If you are setting up the cluster the first time, the following requirements must be met before you can use the DeepSea runner:
Salt and DeepSea are properly installed and working according to this documentation.
salt-run state.orch ceph.stage.0has been invoked to synchronize all the Salt and DeepSea modules to the minions.Ceph is not yet installed, thus ceph.stage.3 has not yet been run and
/var/lib/cephdoes not yet exist.
20.6.2 Requirements for Existing Installation #
If your cluster is already installed, the following requirements must be met before you can use the DeepSea runner:
Nodes are upgraded to SUSE Enterprise Storage 5.5 and cluster is under DeepSea control.
Ceph cluster is up and healthy.
Upgrade process has synchronized Salt and DeepSea modules to all minion nodes.
20.6.3 Automatic Setup #
On the Salt master run:
root@master #salt-runstate.orch ceph.migrate.subvolumeOn nodes without an existing
/var/lib/cephdirectory, this will, one node at a time:create
/var/lib/cephas a@/var/lib/cephBtrfs sub-volume.mount the new sub-volume and update
/etc/fstabappropriately.disable copy-on-write for
/var/lib/ceph.
On nodes with an existing Ceph installation, this will, one node at a time:
terminate running Ceph processes.
unmount OSDs on the node.
create
@/var/lib/cephBtrfs sub-volume and migrate existing/var/lib/cephdata.mount the new sub-volume and update
/etc/fstabappropriately.disable copy-on-write for
/var/lib/ceph/*, omitting/var/lib/ceph/osd/*.re-mount OSDs.
re-start Ceph daemons.
20.6.4 Manual Setup #
This uses the new fs runner.
Inspects the state of
/var/lib/cephon all nodes and print suggestions about how to proceed:root@master #salt-runfs.inspect_varThis will return one of the following commands:
salt-run fs.create_var salt-run fs.migrate_var salt-run fs.correct_var_attrs
Run the command that was returned in the previous step.
If an error occurs on one of the nodes, the execution for other nodes will stop as well and the runner will try to revert performed changes. Consult the log files on the minions with the problem to determine the problem. The runner can be re-run after the problem has been solved.
The command salt-run fs.help provides a list of all
runner and module commands for the fs module.
20.7 Increasing File Descriptors #
For OSD daemons, the read/write operations are critical to keep the Ceph cluster balanced. They often need to have many files open for reading and writing at the same time. On the OS level, the maximum number of simultaneously open files is called 'maximum number of file descriptors'.
To prevent OSDs from running out of file descriptors, you can override the
OS default value and specify the number in
/etc/ceph/ceph.conf, for example:
max_open_files = 131072
After you change max_open_files, you need to restart the
OSD service on the relevant Ceph node.
20.8 How to Use Existing Partitions for OSDs Including OSD Journals #
Important
This section describes an advanced topic that only storage experts and developers should examine. It is mostly needed when using non-standard OSD journal sizes. If the OSD partition's size is less than 10GB, its initial weight is rounded to 0 and because no data are therefore placed on it, you should increase its weight. We take no responsibility for overfilled journals.
If you need to use existing disk partitions as an OSD node, the OSD journal and data partitions need to be in a GPT partition table.
You need to set the correct partition types to the OSD partitions so that
udev recognizes them correctly and sets their
ownership to ceph:ceph.
For example, to set the partition type for the journal partition
/dev/vdb1 and data partition
/dev/vdb2, run the following:
root #sgdisk --typecode=1:45b0969e-9b03-4f30-b4c6-b4b80ceff106 /dev/vdbroot #sgdisk --typecode=2:4fbd7e29-9d25-41b8-afd0-062c0ceff05d /dev/vdb
Tip
The Ceph partition table types are listed in
/usr/lib/udev/rules.d/95-ceph-osd.rules:
cat /usr/lib/udev/rules.d/95-ceph-osd.rules
# OSD_UUID
ACTION=="add", SUBSYSTEM=="block", \
ENV{DEVTYPE}=="partition", \
ENV{ID_PART_ENTRY_TYPE}=="4fbd7e29-9d25-41b8-afd0-062c0ceff05d", \
OWNER:="ceph", GROUP:="ceph", MODE:="660", \
RUN+="/usr/sbin/ceph-disk --log-stdout -v trigger /dev/$name"
ACTION=="change", SUBSYSTEM=="block", \
ENV{ID_PART_ENTRY_TYPE}=="4fbd7e29-9d25-41b8-afd0-062c0ceff05d", \
OWNER="ceph", GROUP="ceph", MODE="660"
# JOURNAL_UUID
ACTION=="add", SUBSYSTEM=="block", \
ENV{DEVTYPE}=="partition", \
ENV{ID_PART_ENTRY_TYPE}=="45b0969e-9b03-4f30-b4c6-b4b80ceff106", \
OWNER:="ceph", GROUP:="ceph", MODE:="660", \
RUN+="/usr/sbin/ceph-disk --log-stdout -v trigger /dev/$name"
ACTION=="change", SUBSYSTEM=="block", \
ENV{ID_PART_ENTRY_TYPE}=="45b0969e-9b03-4f30-b4c6-b4b80ceff106", \
OWNER="ceph", GROUP="ceph", MODE="660"
[...]20.9 Integration with Virtualization Software #
20.9.1 Storing KVM Disks in Ceph Cluster #
You can create a disk image for KVM-driven virtual machine, store it in a
Ceph pool, optionally convert the content of an existing image to it, and
then run the virtual machine with qemu-kvm making use of
the disk image stored in the cluster. For more detailed information, see
Chapter 19, Ceph as a Back-end for QEMU KVM Instance.
20.9.2 Storing libvirt Disks in Ceph Cluster #
Similar to KVM (see Section 20.9.1, “Storing KVM Disks in Ceph Cluster”), you
can use Ceph to store virtual machines driven by libvirt. The advantage
is that you can run any libvirt-supported virtualization solution, such
as KVM, Xen, or LXC. For more information, see
Chapter 18, Using libvirt with Ceph.
20.9.3 Storing Xen Disks in Ceph Cluster #
One way to use Ceph for storing Xen disks is to make use of libvirt
as described in Chapter 18, Using libvirt with Ceph.
Another option is to make Xen talk to the rbd
block device driver directly:
If you have no disk image prepared for Xen, create a new one:
cephadm >rbd create myimage --size 8000 --pool mypoolList images in the pool
mypooland check if your new image is there:cephadm >rbd list mypoolCreate a new block device by mapping the
myimageimage to therbdkernel module:cephadm >rbd map --pool mypool myimageTip: User Name and Authentication
To specify a user name, use
--id user-name. Moreover, if you usecephxauthentication, you must also specify a secret. It may come from a keyring or a file containing the secret:cephadm >rbd map --pool rbd myimage --id admin --keyring /path/to/keyringor
cephadmrbd map --pool rbd myimage --id admin --keyfile /path/to/fileList all mapped devices:
rbd showmappedid pool image snap device 0 mypool myimage - /dev/rbd0Now you can configure Xen to use this device as a disk for running a virtual machine. You can for example add the following line to the
xl-style domain configuration file:disk = [ '/dev/rbd0,,sda', '/dev/cdrom,,sdc,cdrom' ]
20.10 Firewall Settings for Ceph #
Warning: DeepSea Stages Fail with Firewall
DeepSea deployment stages fail when firewall is active (and even configured). To pass the stages correctly, you need to either turn the firewall off by running
root # systemctl stop SuSEfirewall2.service
or set the FAIL_ON_WARNING option to 'False' in
/srv/pillar/ceph/stack/global.yml:
FAIL_ON_WARNING: False
We recommend protecting the network cluster communication with SUSE Firewall. You can edit its configuration by selecting / / / .
Following is a list of Ceph related services and numbers of the ports that they normally use:
- Ceph Monitor
Enable the service or port 6789 (TCP).
- Ceph OSD or Metadata Server
Enable the service, or ports 6800-7300 (TCP).
- iSCSI Gateway
Open port 3260 (TCP).
- Object Gateway
Open the port where Object Gateway communication occurs. It is set in
/etc/ceph.confon the line starting withrgw frontends =. Default is 80 for HTTP and 443 for HTTPS (TCP).- NFS Ganesha
By default, NFS Ganesha uses ports 2049 (NFS service, TCP) and 875 (rquota support, TCP). Refer to Section 16.2.3, “Changing Default NFS Ganesha Ports” for more information on changing the default NFS Ganesha ports.
- Apache based services, such as openATTIC, SMT, or SUSE Manager
Open ports 80 for HTTP and 443 for HTTPS (TCP).
- SSH
Open port 22 (TCP).
- NTP
Open port 123 (UDP).
- Salt
Open ports 4505 and 4506 (TCP).
- Grafana
Open port 3000 (TCP).
- Prometheus
Open port 9100 (TCP).
20.11 Testing Network Performance #
To test the network performance the DeepSea net runner
provides the following commands.
A simple ping to all nodes:
root@master #salt-runnet.ping Succeeded: 9 addresses from 9 minions average rtt 1.35 msA jumbo ping to all nodes:
root@master #salt-runnet.jumbo_ping Succeeded: 9 addresses from 9 minions average rtt 2.13 msA bandwidth test:
root@master #salt-runnet.iperf Fastest 2 hosts: |_ - 192.168.58.106 - 2981 Mbits/sec |_ - 192.168.58.107 - 2967 Mbits/sec Slowest 2 hosts: |_ - 192.168.58.102 - 2857 Mbits/sec |_ - 192.168.58.103 - 2842 Mbits/secTip: Stop 'iperf3' Processes Manually
When running a test using the
net.iperfrunner, the 'iperf3' server processes that are started do not stop automatically when a test is completed. To stop the processes, use the following runner:root@master #salt '*' multi.kill_iperf_cmd
20.12 Replacing Storage Disk #
If you need to replace a storage disk in a Ceph cluster, you can do so during the cluster's full operation. The replacement will cause temporary increase in data transfer.
If the disk fails entirely, Ceph needs to rewrite at least the same amount of data as the capacity of the failed disk is. If the disk is properly evacuated and then re-added to avoid loss of redundancy during the process, the amount of rewritten data will be twice as big. If the new disk has a different size as the replaced one, it will cause some additional data to be redistributed to even out the usage of all OSDs.