Hints and Tips | Administration Guide | SUSE Enterprise Storage 5.5 (SES 5 & SES 5.5)

Applies to SUSE Enterprise Storage 5.5 (SES 5 & SES 5.5)

20 Hints and Tips #

The chapter provides information to help you enhance performance of your Ceph cluster and provides tips how to set the cluster up.

20.1 Identifying Orphaned Partitions #

To identify possibly orphaned journal/WAL/DB devices, follow these steps:

Pick the device that may have orphaned partitions and save the list of its partitions to a file:
```
root@minion > ls /dev/sdd?* > /tmp/partitions
```
Run readlink against all block.wal, block.db, and journal devices, and compare the output to the previously saved list of partitions:
```
root@minion > readlink -f /var/lib/ceph/osd/ceph-*/{block.wal,block.db,journal} \
 | sort | comm -23 /tmp/partitions -
```
The output is the list of partitions that are not used by Ceph.
Remove the orphaned partitions that do not belong to Ceph with your preferred command (for example fdisk, parted, or sgdisk).

20.2 Adjusting Scrubbing #

By default, Ceph performs light scrubbing (find more details in Section 7.5, “Scrubbing”) daily and deep scrubbing weekly. Light scrubbing checks object sizes and checksums to ensure that placement groups are storing the same object data. Deep scrubbing checks an object’s content with that of its replicas to ensure that the actual contents are the same. The price for checking data integrity is increased I/O load on the cluster during the scrubbing procedure.

The default settings allow Ceph OSDs to initiate scrubbing at inappropriate times, such as during periods of heavy loads. Customers may experience latency and poor performance when scrubbing operations conflict with their operations. Ceph provides several scrubbing settings that can limit scrubbing to periods with lower loads or during off-peak hours.

If the cluster experiences high loads during the day and low loads late at night, consider restricting scrubbing to night time hours, such as 11pm till 6am:

[osd]
osd_scrub_begin_hour = 23
osd_scrub_end_hour = 6

If time restriction is not an effective method of determining a scrubbing schedule, consider using the osd_scrub_load_threshold option. The default value is 0.5, but it could be modified for low load conditions:

[osd]
osd_scrub_load_threshold = 0.25

20.3 Stopping OSDs without Rebalancing #

You may need to stop OSDs for maintenance periodically. If you do not want CRUSH to automatically rebalance the cluster in order to avoid huge data transfers, set the cluster to noout first:

root@minion > ceph osd set noout

When the cluster is set to noout, you can begin stopping the OSDs within the failure domain that requires maintenance work:

root@minion > systemctl stop ceph-osd@OSD_NUMBER.service

Find more information in Section 3.1.2, “Starting, Stopping, and Restarting Individual Services”.

After you complete the maintenance, start OSDs again:

root@minion > systemctl start ceph-osd@OSD_NUMBER.service

After OSD services are started, unset the cluster from noout:

cephadm > ceph osd unset noout

20.4 Time Synchronization of Nodes #

Ceph requires precise time synchronization between all nodes.

We recommend synchronizing all Ceph cluster nodes with at least three reliable time sources that are located on the internal network. The internal time sources can point to a public time server or have their own time source.

Important: Public Time Servers

Do not synchronize all Ceph cluster nodes directly with remote public time servers. With such a configuration, each node in the cluster has its own NTP daemon that communicates continually over the Internet with a set of three or four time servers that may provide slightly different time. This solution introduces a large degree of latency variability that makes it difficult or impossible to keep the clock drift under 0.05 seconds which is what the Ceph monitors require.

For details how to set up the NTP server refer to SUSE Linux Enterprise Server Administration Guide.

Then to change the time on your cluster, do the following:

Important: Setting Time

You may face a situation when you need to set the time back, for example if the time changes from the summer to the standard time. We do not recommend to move the time backward for a longer period than the cluster is down. Moving the time forward does not cause any trouble.

Procedure 20.1: Time Synchronization on the Cluster #

Stop all clients accessing the Ceph cluster, especially those using iSCSI.
Shut down your Ceph cluster. On each node run:
```
root # systemctl stop ceph.target
```
Note
If you use Ceph and SUSE OpenStack Cloud, stop also the SUSE OpenStack Cloud.
Verify that your NTP server is set up correctly—all ntpd daemons get their time from a source or sources in the local network.
Set the correct time on your NTP server.
Verify that NTP is running and working properly, on all nodes run:
```
root # systemctl status ntpd.service
```
or
```
root # ntpq -p
```
Start all monitoring nodes and verify that there is no clock skew:
```
root # systemctl start target
```
Start all OSD nodes.
Start other Ceph services.
Start the SUSE OpenStack Cloud if you have it.

20.5 Checking for Unbalanced Data Writing #

When data is written to OSDs evenly, the cluster is considered balanced. Each OSD within a cluster is assigned its weight. The weight is a relative number and tells Ceph how much of the data should be written to the related OSD. The higher the weight, the more data will be written. If an OSD has zero weight, no data will be written to it. If the weight of an OSD is relatively high compared to other OSDs, a large portion of the data will be written there, which makes the cluster unbalanced.

Unbalanced clusters have poor performance, and in the case that an OSD with a high weight suddenly crashes, a lot of data needs to be moved to other OSDs, which slows down the cluster as well.

To avoid this, you should regularly check OSDs for the amount of data writing. If the amount is between 30% and 50% of the capacity of a group of OSDs specified by a given rule set, you need to reweight the OSDs. Check for individual disks and find out which of them fill up faster than the others (or are generally slower), and lower their weight. The same is valid for OSDs where not enough data is written—you can increase their weight to have Ceph write more data to them. In the following example, you will find out the weight of an OSD with ID 13, and reweight it from 3 to 3.05:

$ ceph osd tree | grep osd.13
 13  3                   osd.13  up  1

 $ ceph osd crush reweight osd.13 3.05
 reweighted item id 13 name 'osd.13' to 3.05 in crush map

 $ ceph osd tree | grep osd.13
 13  3.05                osd.13  up  1

Tip: OSD Reweight by Utilization

The ceph osd reweight-by-utilization threshold command automates the process of reducing the weight of OSDs which are heavily overused. By default it will adjust the weights downward on OSDs which reached 120% of the average usage, but if you include threshold it will use that percentage instead.

20.6 Btrfs Sub-volume for /var/lib/ceph #

SUSE Linux Enterprise by default is installed on a Btrfs partition. The directory /var/lib/ceph should be excluded from Btrfs snapshots and rollbacks, especially when a MON is running on the node. DeepSea provides the fs runner that can set up a sub-volume for this path.

20.6.1 Requirements for new Installation #

If you are setting up the cluster the first time, the following requirements must be met before you can use the DeepSea runner:

Salt and DeepSea are properly installed and working according to this documentation.
salt-run state.orch ceph.stage.0 has been invoked to synchronize all the Salt and DeepSea modules to the minions.
Ceph is not yet installed, thus ceph.stage.3 has not yet been run and /var/lib/ceph does not yet exist.

20.6.2 Requirements for Existing Installation #

If your cluster is already installed, the following requirements must be met before you can use the DeepSea runner:

Nodes are upgraded to SUSE Enterprise Storage 5.5 and cluster is under DeepSea control.
Ceph cluster is up and healthy.
Upgrade process has synchronized Salt and DeepSea modules to all minion nodes.

20.6.3 Automatic Setup #

On the Salt master run:
```
root@master # salt-run state.orch ceph.migrate.subvolume
```
On nodes without an existing /var/lib/ceph directory, this will, one node at a time:
- create /var/lib/ceph as a @/var/lib/ceph Btrfs sub-volume.
- mount the new sub-volume and update /etc/fstab appropriately.
- disable copy-on-write for /var/lib/ceph.
On nodes with an existing Ceph installation, this will, one node at a time:
- terminate running Ceph processes.
- unmount OSDs on the node.
- create @/var/lib/ceph Btrfs sub-volume and migrate existing /var/lib/ceph data.
- mount the new sub-volume and update /etc/fstab appropriately.
- disable copy-on-write for /var/lib/ceph/*, omitting /var/lib/ceph/osd/*.
- re-mount OSDs.
- re-start Ceph daemons.

20.6.4 Manual Setup #

This uses the new fs runner.

Inspects the state of /var/lib/ceph on all nodes and print suggestions about how to proceed:
```
root@master # salt-run fs.inspect_var
```
This will return one of the following commands:
```
salt-run fs.create_var
salt-run fs.migrate_var
salt-run fs.correct_var_attrs
```
Run the command that was returned in the previous step.
If an error occurs on one of the nodes, the execution for other nodes will stop as well and the runner will try to revert performed changes. Consult the log files on the minions with the problem to determine the problem. The runner can be re-run after the problem has been solved.

The command salt-run fs.help provides a list of all runner and module commands for the fs module.

20.7 Increasing File Descriptors #

For OSD daemons, the read/write operations are critical to keep the Ceph cluster balanced. They often need to have many files open for reading and writing at the same time. On the OS level, the maximum number of simultaneously open files is called 'maximum number of file descriptors'.

To prevent OSDs from running out of file descriptors, you can override the OS default value and specify the number in /etc/ceph/ceph.conf, for example:

max_open_files = 131072

After you change max_open_files, you need to restart the OSD service on the relevant Ceph node.

20.8 How to Use Existing Partitions for OSDs Including OSD Journals #

Important

This section describes an advanced topic that only storage experts and developers should examine. It is mostly needed when using non-standard OSD journal sizes. If the OSD partition's size is less than 10GB, its initial weight is rounded to 0 and because no data are therefore placed on it, you should increase its weight. We take no responsibility for overfilled journals.

If you need to use existing disk partitions as an OSD node, the OSD journal and data partitions need to be in a GPT partition table.

You need to set the correct partition types to the OSD partitions so that udev recognizes them correctly and sets their ownership to ceph:ceph.

For example, to set the partition type for the journal partition /dev/vdb1 and data partition /dev/vdb2, run the following:

root # sgdisk --typecode=1:45b0969e-9b03-4f30-b4c6-b4b80ceff106 /dev/vdb
root # sgdisk --typecode=2:4fbd7e29-9d25-41b8-afd0-062c0ceff05d /dev/vdb

Tip

The Ceph partition table types are listed in /usr/lib/udev/rules.d/95-ceph-osd.rules:

cat /usr/lib/udev/rules.d/95-ceph-osd.rules
# OSD_UUID
ACTION=="add", SUBSYSTEM=="block", \
  ENV{DEVTYPE}=="partition", \
  ENV{ID_PART_ENTRY_TYPE}=="4fbd7e29-9d25-41b8-afd0-062c0ceff05d", \
  OWNER:="ceph", GROUP:="ceph", MODE:="660", \
  RUN+="/usr/sbin/ceph-disk --log-stdout -v trigger /dev/$name"
ACTION=="change", SUBSYSTEM=="block", \
  ENV{ID_PART_ENTRY_TYPE}=="4fbd7e29-9d25-41b8-afd0-062c0ceff05d", \
  OWNER="ceph", GROUP="ceph", MODE="660"

# JOURNAL_UUID
ACTION=="add", SUBSYSTEM=="block", \
  ENV{DEVTYPE}=="partition", \
  ENV{ID_PART_ENTRY_TYPE}=="45b0969e-9b03-4f30-b4c6-b4b80ceff106", \
  OWNER:="ceph", GROUP:="ceph", MODE:="660", \
  RUN+="/usr/sbin/ceph-disk --log-stdout -v trigger /dev/$name"
ACTION=="change", SUBSYSTEM=="block", \
  ENV{ID_PART_ENTRY_TYPE}=="45b0969e-9b03-4f30-b4c6-b4b80ceff106", \
  OWNER="ceph", GROUP="ceph", MODE="660"
[...]

20.9 Integration with Virtualization Software #

20.9.1 Storing KVM Disks in Ceph Cluster #

You can create a disk image for KVM-driven virtual machine, store it in a Ceph pool, optionally convert the content of an existing image to it, and then run the virtual machine with qemu-kvm making use of the disk image stored in the cluster. For more detailed information, see Chapter 19, Ceph as a Back-end for QEMU KVM Instance.

20.9.2 Storing `libvirt` Disks in Ceph Cluster #

Similar to KVM (see Section 20.9.1, “Storing KVM Disks in Ceph Cluster”), you can use Ceph to store virtual machines driven by libvirt. The advantage is that you can run any libvirt-supported virtualization solution, such as KVM, Xen, or LXC. For more information, see Chapter 18, Using libvirt with Ceph.

20.9.3 Storing Xen Disks in Ceph Cluster #

One way to use Ceph for storing Xen disks is to make use of libvirt as described in Chapter 18, Using libvirt with Ceph.

Another option is to make Xen talk to the rbd block device driver directly:

If you have no disk image prepared for Xen, create a new one:
```
cephadm > rbd create myimage --size 8000 --pool mypool
```
List images in the pool mypool and check if your new image is there:
```
cephadm > rbd list mypool
```
Create a new block device by mapping the myimage image to the rbd kernel module:
```
cephadm > rbd map --pool mypool myimage
```
Tip: User Name and Authentication
To specify a user name, use --id user-name. Moreover, if you use cephx authentication, you must also specify a secret. It may come from a keyring or a file containing the secret:
```
cephadm > rbd map --pool rbd myimage --id admin --keyring /path/to/keyring
```
or
```
cephadmrbd map --pool rbd myimage --id admin --keyfile /path/to/file
```

List all mapped devices:

rbd showmapped
 id pool   image   snap device
 0  mypool myimage -    /dev/rbd0

Now you can configure Xen to use this device as a disk for running a virtual machine. You can for example add the following line to the xl-style domain configuration file:
```
disk = [ '/dev/rbd0,,sda', '/dev/cdrom,,sdc,cdrom' ]
```

20.10 Firewall Settings for Ceph #

Warning: DeepSea Stages Fail with Firewall

DeepSea deployment stages fail when firewall is active (and even configured). To pass the stages correctly, you need to either turn the firewall off by running

root # systemctl stop SuSEfirewall2.service

or set the FAIL_ON_WARNING option to 'False' in /srv/pillar/ceph/stack/global.yml:

FAIL_ON_WARNING: False

We recommend protecting the network cluster communication with SUSE Firewall. You can edit its configuration by selecting YaST / Security and Users / Firewall / Allowed Services.

Following is a list of Ceph related services and numbers of the ports that they normally use:

Ceph Monitor: Enable the Ceph MON service or port 6789 (TCP).
Ceph OSD or Metadata Server: Enable the Ceph OSD/MDS service, or ports 6800-7300 (TCP).
iSCSI Gateway: Open port 3260 (TCP).
Object Gateway: Open the port where Object Gateway communication occurs. It is set in /etc/ceph.conf on the line starting with rgw frontends =. Default is 80 for HTTP and 443 for HTTPS (TCP).
NFS Ganesha: By default, NFS Ganesha uses ports 2049 (NFS service, TCP) and 875 (rquota support, TCP). Refer to Section 16.2.3, “Changing Default NFS Ganesha Ports” for more information on changing the default NFS Ganesha ports.
Apache based services, such as openATTIC, SMT, or SUSE Manager: Open ports 80 for HTTP and 443 for HTTPS (TCP).
SSH: Open port 22 (TCP).
NTP: Open port 123 (UDP).
Salt: Open ports 4505 and 4506 (TCP).
Grafana: Open port 3000 (TCP).
Prometheus: Open port 9100 (TCP).

20.11 Testing Network Performance #

To test the network performance the DeepSea net runner provides the following commands.

A simple ping to all nodes:

root@master # salt-run net.ping
Succeeded: 9 addresses from 9 minions average rtt 1.35 ms

A jumbo ping to all nodes:

root@master # salt-run net.jumbo_ping
Succeeded: 9 addresses from 9 minions average rtt 2.13 ms

A bandwidth test:

root@master # salt-run net.iperf
Fastest 2 hosts:
    |_
      - 192.168.58.106
      - 2981 Mbits/sec
    |_
      - 192.168.58.107
      - 2967 Mbits/sec
Slowest 2 hosts:
    |_
      - 192.168.58.102
      - 2857 Mbits/sec
    |_
      - 192.168.58.103
      - 2842 Mbits/sec

Tip: Stop 'iperf3' Processes Manually

When running a test using the net.iperf runner, the 'iperf3' server processes that are started do not stop automatically when a test is completed. To stop the processes, use the following runner:

root@master # salt '*' multi.kill_iperf_cmd

20.12 Replacing Storage Disk #

If you need to replace a storage disk in a Ceph cluster, you can do so during the cluster's full operation. The replacement will cause temporary increase in data transfer.

If the disk fails entirely, Ceph needs to rewrite at least the same amount of data as the capacity of the failed disk is. If the disk is properly evacuated and then re-added to avoid loss of redundancy during the process, the amount of rewritten data will be twice as big. If the new disk has a different size as the replaced one, it will cause some additional data to be redistributed to even out the usage of all OSDs.