Hints and Tips | Administration Guide | SUSE Enterprise Storage 6

Applies to SUSE Enterprise Storage 6

33 Hints and Tips #

The chapter provides information to help you enhance performance of your Ceph cluster and provides tips how to set the cluster up.

33.1 Identifying Orphaned Partitions #

To identify possibly orphaned journal/WAL/DB devices, follow these steps:

Pick the device that may have orphaned partitions and save the list of its partitions to a file:
```
root@minion > ls /dev/sdd?* > /tmp/partitions
```
Run readlink against all block.wal, block.db, and journal devices, and compare the output to the previously saved list of partitions:
```
root@minion > readlink -f /var/lib/ceph/osd/ceph-*/{block.wal,block.db,journal} \
 | sort | comm -23 /tmp/partitions -
```
The output is the list of partitions that are not used by Ceph.
Remove the orphaned partitions that do not belong to Ceph with your preferred command (for example, fdisk, parted, or sgdisk).

33.2 Adjusting Scrubbing #

By default, Ceph performs light scrubbing daily (find more details in Section 20.6, “Scrubbing”) and deep scrubbing weekly. Light scrubbing checks object sizes and checksums to ensure that placement groups are storing the same object data. Deep scrubbing checks an object’s content with that of its replicas to ensure that the actual contents are the same. The price for checking data integrity is increased I/O load on the cluster during the scrubbing procedure.

The default settings allow Ceph OSDs to initiate scrubbing at inappropriate times, such as during periods of heavy loads. Customers may experience latency and poor performance when scrubbing operations conflict with their operations. Ceph provides several scrubbing settings that can limit scrubbing to periods with lower loads or during off-peak hours.

If the cluster experiences high loads during the day and low loads late at night, consider restricting scrubbing to night time hours, such as 11pm till 6am:

[osd]
osd_scrub_begin_hour = 23
osd_scrub_end_hour = 6

If time restriction is not an effective method of determining a scrubbing schedule, consider using the osd_scrub_load_threshold option. The default value is 0.5, but it could be modified for low load conditions:

[osd]
osd_scrub_load_threshold = 0.25

33.3 Stopping OSDs without Rebalancing #

You may need to stop OSDs for maintenance periodically. If you do not want CRUSH to automatically rebalance the cluster in order to avoid huge data transfers, set the cluster to noout first:

root@minion > ceph osd set noout

When the cluster is set to noout, you can begin stopping the OSDs within the failure domain that requires maintenance work:

root@minion > systemctl stop ceph-osd@OSD_NUMBER.service

Find more information in Section 16.1.2, “Starting, Stopping, and Restarting Individual Services”.

After you complete the maintenance, start OSDs again:

root@minion > systemctl start ceph-osd@OSD_NUMBER.service

After OSD services are started, unset the cluster from noout:

cephadm@adm > ceph osd unset noout

33.4 Time Synchronization of Nodes #

Ceph requires precise time synchronization between all nodes.

We recommend synchronizing all Ceph cluster nodes with at least three reliable time sources that are located on the internal network. The internal time sources can point to a public time server or have their own time source.

Important: Public Time Servers

Do not synchronize all Ceph cluster nodes directly with remote public time servers. With such a configuration, each node in the cluster has its own NTP daemon that communicates continually over the Internet with a set of three or four time servers that may provide slightly different times. This solution introduces a large degree of latency variability that makes it difficult or impossible to keep the clock drift under 0.05 seconds, which is what the Ceph monitors require.

For details how to set up the NTP server refer to SUSE Linux Enterprise Server Administration Guide.

Then to change the time on your cluster, do the following:

Important: Setting Time

You may face a situation when you need to set the time back, for example if the time changes from the summer to the standard time. We do not recommend to move the time backward for a longer period than the cluster is down. Moving the time forward does not cause any trouble.

Procedure 33.1: Time Synchronization on the Cluster #

Stop all clients accessing the Ceph cluster, especially those using iSCSI.
Shut down your Ceph cluster. On each node run:
```
root # systemctl stop ceph.target
```
Note
If you use Ceph and SUSE OpenStack Cloud, stop also the SUSE OpenStack Cloud.
Verify that your NTP server is set up correctly—all chronyd daemons get their time from a source or sources in the local network.
Set the correct time on your NTP server.
Verify that NTP is running and working properly, on all nodes run:
```
root # systemctl status chronyd.service
```
Start all monitoring nodes and verify that there is no clock skew:
```
root # systemctl start target
```
Start all OSD nodes.
Start other Ceph services.
Start the SUSE OpenStack Cloud if you have it.

33.5 Checking for Unbalanced Data Writing #

When data is written to OSDs evenly, the cluster is considered balanced. Each OSD within a cluster is assigned its weight. The weight is a relative number and tells Ceph how much of the data should be written to the related OSD. The higher the weight, the more data will be written. If an OSD has zero weight, no data will be written to it. If the weight of an OSD is relatively high compared to other OSDs, a large portion of the data will be written there, which makes the cluster unbalanced.

Unbalanced clusters have poor performance, and in the case that an OSD with a high weight suddenly crashes, a lot of data needs to be moved to other OSDs, which slows down the cluster as well.

To avoid this, you should regularly check OSDs for the amount of data writing. If the amount is between 30% and 50% of the capacity of a group of OSDs specified by a given ruleset, you need to reweight the OSDs. Check for individual disks and find out which of them fill up faster than the others (or are generally slower), and lower their weight. The same is valid for OSDs where not enough data is written—you can increase their weight to have Ceph write more data to them. In the following example, you will find out the weight of an OSD with ID 13, and reweight it from 3 to 3.05:

$ ceph osd tree | grep osd.13
 13  3                   osd.13  up  1

 $ ceph osd crush reweight osd.13 3.05
 reweighted item id 13 name 'osd.13' to 3.05 in crush map

 $ ceph osd tree | grep osd.13
 13  3.05                osd.13  up  1

Tip: OSD Reweight by Utilization

The ceph osd reweight-by-utilization threshold command automates the process of reducing the weight of OSDs which are heavily overused. By default it will adjust the weights downward on OSDs which reached 120% of the average usage, but if you include threshold it will use that percentage instead.

33.6 Btrfs Subvolume for `/var/lib/ceph` on Ceph Monitor Nodes #

SUSE Linux Enterprise is by default installed on a Btrfs partition. Ceph Monitors store their state and database in the /var/lib/ceph directory. To prevent corruption of a Ceph Monitor from a system rollback of a previous snapshot, create a Btrfs subvolume for /var/lib/ceph. A dedicated subvolume excludes the monitor data from snapshots of the root subvolume.

Tip

Create the /var/lib/ceph subvolume before running DeepSea stage 0 because stage 0 installs Ceph related packages and creates the /var/lib/ceph directory.

DeepSea stage 3 then verifies whether @/var/lib/ceph is a Btrfs subvolume and fails if it is a normal directory.

33.6.1 Requirements #

33.6.1.1 New Deployments #

Salt and DeepSea need to be properly installed and working.

33.6.1.2 Existing Deployments #

If your cluster is already installed, the following requirements must be met:

Nodes are upgraded to SUSE Enterprise Storage 6 and cluster is under DeepSea control.
Ceph cluster is up and healthy.
Upgrade process has synchronized Salt and DeepSea modules to all minion nodes.

33.6.2 Steps Required during a New Cluster Deployment #

33.6.2.1 Before Running DeepSea stage 0 #

Prior to running DeepSea stage 0, apply the following commands to each of the Salt minions that will become Ceph Monitors:

root@master # salt 'MONITOR_NODES' saltutil.sync_all
root@master # salt 'MONITOR_NODES' state.apply ceph.subvolume

The ceph.subvolume command does the following:

Creates /var/lib/ceph as a @/var/lib/ceph Btrfs subvolume.
Mounts the new subvolume and updates /etc/fstab appropriately.

33.6.2.2 DeepSea stage 3 Validation Fails #

If you forgot to run the commands mentioned in Section 33.6.2.1, “Before Running DeepSea stage 0” before running stage 0, the /var/lib/ceph subdirectory already exists, causing DeepSea stage 3 validation failure. To convert it into a subvolume, do the following:

Change directory to /var/lib:
```
cephadm@mon > cd /var/lib
```
Back up the current content of the ceph subdirectory:
```
cephadm@mon > sudo mv ceph ceph-
```

Create the subvolume and, mount it, and update /etc/fstab:

root@master # salt 'MONITOR_NODES' state.apply ceph.subvolume

Change to the backup subdirectory, synchronize its content with the new subvolume, then remove it:

cephadm@mon > cd /var/lib/ceph-
cephadm@mon > rsync -av . ../ceph
cephadm@mon > cd ..
cephadm@mon > rm -rf ./ceph-

33.6.3 Steps Required during Cluster Upgrade #

On SUSE Enterprise Storage 5.5, the /var directory is not on a Btrfs subvolume, but its subfolders (such as /var/log or /var/cache) are Btrfs subvolumes under '@'. Creating @/var/lib/ceph subvolumes requires mounting the '@' subvolume first (it is not mounted by default) and creating the @/var/lib/ceph subvolume under it.

Following are example commands that illustrate the process:

root # mkdir -p /mnt/btrfs
root # mount -o subvol=@ ROOT_DEVICE /mnt/btrfs
root # btrfs subvolume create /mnt/btrfs/var/lib/ceph
root # umount /mnt/btrfs

At this point the @/var/lib/ceph subvolume is created and you can continue as described in Section 33.6.2, “Steps Required during a New Cluster Deployment”.

33.6.4 Manual Setup #

Automatic setup of the @/var/lib/ceph Btrfs subvolume on the Ceph Monitor nodes may not be suitable for all scenarios. You can migrate your /var/lib/ceph directory to a @/var/lib/ceph subvolume by following these steps:

Terminate running Ceph processes.
Unmount OSDs on the node.

Change to the backup subdirectory, synchronize its content with the new subvolume, then remove it:

cephadm@mon > cd /var/lib/ceph-
cephadm@mon > rsync -av . ../ceph
cephadm@mon > cd ..
cephadm@mon > rm -rf ./ceph-

Remount OSDs.
Restart Ceph daemons.

33.6.5 For More Information #

Find more detailed information about manual setup in the file /srv/salt/ceph/subvolume/README.md on the Salt master node.

33.7 Increasing File Descriptors #

For OSD daemons, the read/write operations are critical to keep the Ceph cluster balanced. They often need to have many files open for reading and writing at the same time. On the OS level, the maximum number of simultaneously open files is called 'maximum number of file descriptors'.

To prevent OSDs from running out of file descriptors, you can override the OS default value and specify the number in /etc/ceph/ceph.conf, for example:

max_open_files = 131072

After you change max_open_files, you need to restart the OSD service on the relevant Ceph node.

33.8 Integration with Virtualization Software #

33.8.1 Storing KVM Disks in Ceph Cluster #

You can create a disk image for KVM-driven virtual machine, store it in a Ceph pool, optionally convert the content of an existing image to it, and then run the virtual machine with qemu-kvm making use of the disk image stored in the cluster. For more detailed information, see Chapter 32, Ceph as a Back-end for QEMU KVM Instance.

33.8.2 Storing `libvirt` Disks in Ceph Cluster #

Similar to KVM (see Section 33.8.1, “Storing KVM Disks in Ceph Cluster”), you can use Ceph to store virtual machines driven by libvirt. The advantage is that you can run any libvirt-supported virtualization solution, such as KVM, Xen, or LXC. For more information, see Chapter 31, Using libvirt with Ceph.

33.8.3 Storing Xen Disks in Ceph Cluster #

One way to use Ceph for storing Xen disks is to make use of libvirt as described in Chapter 31, Using libvirt with Ceph.

Another option is to make Xen talk to the rbd block device driver directly:

If you have no disk image prepared for Xen, create a new one:
```
cephadm@adm > rbd create myimage --size 8000 --pool mypool
```
List images in the pool mypool and check if your new image is there:
```
cephadm@adm > rbd list mypool
```
Create a new block device by mapping the myimage image to the rbd kernel module:
```
cephadm@adm > rbd map --pool mypool myimage
```
Tip: User Name and Authentication
To specify a user name, use --id user-name. Moreover, if you use cephx authentication, you must also specify a secret. It may come from a keyring or a file containing the secret:
```
cephadm@adm > rbd map --pool rbd myimage --id admin --keyring /path/to/keyring
```
or
```
cephadmrbd map --pool rbd myimage --id admin --keyfile /path/to/file
```

List all mapped devices:

rbd showmapped
 id pool   image   snap device
 0  mypool myimage -    /dev/rbd0

Now you can configure Xen to use this device as a disk for running a virtual machine. You can for example add the following line to the xl-style domain configuration file:
```
disk = [ '/dev/rbd0,,sda', '/dev/cdrom,,sdc,cdrom' ]
```

33.9 Firewall Settings for Ceph #

Warning: DeepSea Stages Fail with Firewall

DeepSea deployment stages fail when firewall is active (and even configured). To pass the stages correctly, you need to either turn the firewall off by running

root # systemctl stop SuSEfirewall2.service

or set the FAIL_ON_WARNING option to 'False' in /srv/pillar/ceph/stack/global.yml:

FAIL_ON_WARNING: False

We recommend protecting the network cluster communication with SUSE Firewall. You can edit its configuration by selecting YaST / Security and Users / Firewall / Allowed Services.

Following is a list of Ceph related services and numbers of the ports that they normally use:

Ceph Monitor: Enable the Ceph MON service or port 6789 (TCP).
Ceph OSD or Metadata Server: Enable the Ceph OSD/MDS service or ports 6800-7300 (TCP).
iSCSI Gateway: Open port 3260 (TCP).
Object Gateway: Open the port where Object Gateway communication occurs. It is set in /etc/ceph.conf on the line starting with rgw frontends =. Default is 80 for HTTP and 443 for HTTPS (TCP).
NFS Ganesha: By default, NFS Ganesha uses ports 2049 (NFS service, TCP) and 875 (rquota support, TCP). Refer to Section 30.2.1.4, “Changing Default NFS Ganesha Ports” for more information on changing the default NFS Ganesha ports.
Apache based services, such as SMT, or SUSE Manager: Open ports 80 for HTTP and 443 for HTTPS (TCP).
SSH: Open port 22 (TCP).
NTP: Open port 123 (UDP).
Salt: Open ports 4505 and 4506 (TCP).
Grafana: Open port 3000 (TCP).
Prometheus: Open port 9100 (TCP).

33.10 Testing Network Performance #

To test the network performance, DeepSea's net runner provides the following commands:

A simple ping to all nodes:

root@master # salt-run net.ping
Succeeded: 9 addresses from 9 minions average rtt 1.35 ms

A jumbo ping to all nodes:

root@master # salt-run net.jumbo_ping
Succeeded: 9 addresses from 9 minions average rtt 2.13 ms

A bandwidth test:

root@master # salt-run net.iperf
Fastest 2 hosts:
    |_
      - 192.168.58.106
      - 2981 Mbits/sec
    |_
      - 192.168.58.107
      - 2967 Mbits/sec
Slowest 2 hosts:
    |_
      - 192.168.58.102
      - 2857 Mbits/sec
    |_
      - 192.168.58.103
      - 2842 Mbits/sec

Tip: Stop 'iperf3' Processes Manually

When running a test using the net.iperf runner, the 'iperf3' server processes that are started do not stop automatically when a test is completed. To stop the processes, use the following runner:

root@master # salt '*' multi.kill_iperf_cmd

33.11 How to Locate Physical Disks Using LED Lights #

This section describes using libstoragemgmt and/or third party tools to adjust the LED lights on physical disks. This capability may not be available for all hardware platforms.

Matching an OSD disk to a physical disk can be challenging, especially on nodes with a high density of disks. Some hardware environments include LED lights that can be adjusted via software to flash or illuminate a different color for identification purposes. SUSE Enterprise Storage offers support for this capability through Salt, libstoragemgmt, and third party tools specific to the hardware in use. The configuration for this capability is defined in the /srv/pillar/ceph/disk_led.sls Salt pillar:

root #  cat /srv/pillar/ceph/disk_led.sls
# This is the default configuration for the storage enclosure LED blinking.
# The placeholder {device_file} will be replaced with the device file of
# the disk when the command is executed.
#
# Have a look into the /srv/pillar/ceph/README file to find out how to
# customize this configuration per minion/host.

disk_led:
  cmd:
    ident:
      'on': lsmcli local-disk-ident-led-on --path '{device_file}'
      'off': lsmcli local-disk-ident-led-off --path '{device_file}'
    fault:
      'on': lsmcli local-disk-fault-led-on --path '{device_file}'
      'off': lsmcli local-disk-fault-led-off --path '{device_file}'

The default configuration for disk_led.sls offers disk LED support through the libstoragemgmt layer. The libstoragemgmt layer provides this support through a hardware-specific plug-in and third party tools. The default behavior can be customized using DeepSea by adding the following for the ledmon package and ledctl tool to /srv/pillar/ceph/stack/global.yml (or any other YAML file in /stack/ceph/):

  disk_led:
    cmd:
      ident:
        'on': ledctl locate='{device_file}'
        'off': ledctl locate_off='{device_file}'
      fault:
        'on': ledctl locate='{device_file}'
        'off': ledctl locate_off='{device_file}'

If the customization should only apply to a special node (minion), then the file stack/ceph/minions/{{minion}}.yml needs to be used.

With or without libstoragemgmt, third party tools may be required to adjust LED lights. These third party tools are available through various hardware vendors. Some of the common vendors and tools are:

Table 33.1: Third Party Storage Tools #

Vendor/Disk Controller	Tool
HPE SmartArray	hpssacli
LSI MegaRAID	storcli

SUSE Linux Enterprise Server also provides the ledmon package and ledctl tool. This tool may also work for hardware environments utilizing Intel storage enclosures. Proper syntax when using this tool is as follows:

root #  cat /srv/pillar/ceph/disk_led.sls
disk_led:
  cmd:
    ident:
      'on': ledctl locate='{device_file}'
      'off': ledctl locate_off='{device_file}'
    fault:
      'on': ledctl locate='{device_file}'
      'off': ledctl locate_off='{device_file}'

If you are on supported hardware, with all required third party tools, LEDs can be enabled or disabled using the following command syntax from the Salt master node:

root # salt-run disk_led.device NODE DISK fault|ident on|off

For example, to enable or disable LED identification or fault lights on /dev/sdd on OSD node srv16.ceph, run the following:

root # salt-run disk_led.device srv16.ceph sdd ident on
root # salt-run disk_led.device srv16.ceph sdd ident off
root # salt-run disk_led.device srv16.ceph sdd fault on
root # salt-run disk_led.device srv16.ceph sdd fault off

Note: Device Naming

The device name used in the salt-run command needs to match the name recognized by Salt. The following command can be used to display these names:

root@master # salt 'minion_name' grains.get disks

In many environments, the /srv/pillar/ceph/disk_led.sls configuration will require changes in order to adjust the LED lights for specific hardware needs. Simple changes may be performed by replacing lsmcli with another tool, or adjusting command line parameters. Complex changes may be accomplished by calling an external script in place of the lsmcli command. When making any changes to /srv/pillar/ceph/disk_led.sls, follow these steps:

Make required changes to /srv/pillar/ceph/disk_led.sls on the Salt master node.
Verify that the changes are reflected correctly in the pillar data:
```
root # salt 'SALT MASTER*' pillar.get disk_led
```
Refresh the pillar data on all nodes using:
```
root # salt '*' saltutil.pillar_refresh
```

It is possible to use an external script to directly use third-party tools to adjust LED lights. The following examples show how to adjust /srv/pillar/ceph/disk_led.sls to support an external script, and two sample scripts for HP and LSI environments.

Modified /srv/pillar/ceph/disk_led.sls which calls an external script:

root # cat /srv/pillar/ceph/disk_led.sls
disk_led:
  cmd:
    ident:
      'on': /usr/local/bin/flash_led.sh '{device_file}' on
      'off': /usr/local/bin/flash_led.sh '{device_file}' off
    fault:
      'on': /usr/local/bin/flash_led.sh '{device_file}' on
      'off': /usr/local/bin/flash_led.sh '{device_file}' off

Sample script for flashing LED lights on HP hardware using the hpssacli utilities:

root # cat /usr/local/bin/flash_led_hp.sh
#!/bin/bash
# params:
#   $1 device (e.g. /dev/sda)
#   $2 on|off

FOUND=0
MAX_CTRLS=10
MAX_DISKS=50

for i in $(seq 0 $MAX_CTRLS); do
  # Search for valid controllers
  if hpssacli ctrl slot=$i show summary >/dev/null; then
    # Search all disks on the current controller
    for j in $(seq 0 $MAX_DISKS); do
      if hpssacli ctrl slot=$i ld $j show | grep -q $1; then
        FOUND=1
        echo "Found $1 on ctrl=$i, ld=$j. Turning LED $2."
        hpssacli ctrl slot=$i ld $j modify led=$2
        break;
      fi
    done
    [[ "$FOUND" = "1" ]] && break
  fi
done

Sample script for flashing LED lights on LSI hardware using the storcli utilities:

root # cat /usr/local/bin/flash_led_lsi.sh
#!/bin/bash
# params:
#   $1 device (e.g. /dev/sda)
#   $2 on|off

[[ "$2" = "on" ]] && ACTION="start" || ACTION="stop"

# Determine serial number for the disk
SERIAL=$(lshw -class disk | grep -A2 $1 | grep serial | awk '{print $NF}')
if [ ! -z "$SERIAL" ]; then
  # Search for disk serial number across all controllers and enclosures
  DEVICE=$(/opt/MegaRAID/storcli/storcli64 /call/eall/sall show all | grep -B6 $SERIAL | grep Drive | awk '{print $2}')
  if [ ! -z "$DEVICE" ]; then
    echo "Found $1 on device $DEVICE. Turning LED $2."
    /opt/MegaRAID/storcli/storcli64 $DEVICE $ACTION locate
  else
    echo "Device not found!"
    exit -1
  fi
else
  echo "Disk serial number not found!"
  exit -1
fi