Operating System Level Tuning | Tuning Guide

Applies to SUSE Enterprise Storage 6

5 Operating System Level Tuning #

A significant amount of the performance tuning for Ceph clusters can be done at the operating system (OS) layer. This tuning involves ensuring that unnecessary services are not running, and extends down to ensuring buffers are not being overrun and interrupts are being spread properly. There are many additional tuning options for the OS that are not included here, either due to statistically-insignificant performance changes, or not being deemed candidates for significant performance improvement at the time of this work.

5.1 SUSE Linux Enterprise Install and Validation of Base Performance #

During the OS installation, do not select an install pattern that includes an X Server. Doing so utilizes RAM and CPU resources that would be better allocated to tuning storage-related daemons. We recommend a pattern that includes the minimal pattern with the addition of the YaST management pattern.

After the OS is installed, it is proper to evaluate the various individual components that are critical to the overall performance of the storage cluster.

Note

Check performance of individual components before SUSE Enterprise Storage is setup to ensure they are performing as desired.

5.1.1 Network Performance #

To perform iperf3 tests for network performance, consider increasing the window size (with the -w option) and running multiple streams to fully test the bandwidth capability. If using the standard MTU, your NICs should be capable of running at approximately 70-80% of the advertised bandwidth. If you move up to jumbo frames, the NIC should be able to saturate the link.

This is not necessarily true for faster topologies such as 100 Gb. In those topologies, saturating the NIC can require substantial OS and driver tuning, in combination with ensuring the hardware has the appropriate CPU clock speeds and settings.

This is a sample of the iperf3 commands used on a 100 Gb network. In the command line, the -N disables Nagle's buffering algorithm and the -l sets the buffer length to higher than the default 128 kB, resulting in slightly higher throughput.

  server# iperf3 -s

  client# iperf3   -c server -N -l 256k
  Connecting to host sr650-1, port 5201
  [  4] local 172.16.227.22 port 36628 connected to 172.16.227.21 port 5201
  [ ID] Interval           Transfer     Bandwidth       Retr  Cwnd
  [  4]   0.00-1.00   sec  4.76 GBytes  40.9 Gbits/sec    0   1.48 MBytes
  [  4]   1.00-2.00   sec  4.79 GBytes  41.1 Gbits/sec    0   2.52 MBytes
  [  4]   2.00-3.00   sec  4.73 GBytes  40.6 Gbits/sec    0   2.52 MBytes
  [  4]   3.00-4.00   sec  4.73 GBytes  40.6 Gbits/sec    0   2.52 MBytes
  [  4]   4.00-5.00   sec  4.74 GBytes  40.7 Gbits/sec    0   2.52 MBytes
  [  4]   5.00-6.00   sec  4.73 GBytes  40.6 Gbits/sec    0   2.52 MBytes
  [  4]   6.00-7.00   sec  4.72 GBytes  40.6 Gbits/sec    0   2.52 MBytes
  [  4]   7.00-8.00   sec  4.72 GBytes  40.6 Gbits/sec    0   2.52 MBytes
  [  4]   8.00-9.00   sec  4.73 GBytes  40.7 Gbits/sec    0   2.52 MBytes
  [  4]   9.00-10.00  sec  4.73 GBytes  40.6 Gbits/sec    0   2.52 MBytes
  - - - - - - - - - - - - - - - - - - - - - - - - -
  [ ID] Interval           Transfer     Bandwidth       Retr
  [  4]   0.00-10.00  sec  47.4 GBytes  40.7 Gbits/sec    0             sender
  [  4]   0.00-10.00  sec  47.4 GBytes  40.7 Gbits/sec                  receiver

5.1.2 Storage Performance #

Use fio to test individual storage devices to understand the per-device performance maxima. Do this for all devices to ensure that none are out of specification, or are connected to expanders which lower bandwidth. Doing an exhaustive study of different I/O sizes and patterns would provide the most information about performance expectations, but is beyond the scope of this document.

We recommend testing at least random 4 kB, random 64 kB, and sequential 64 kB and 1 MB buffers. This should give a reasonable overall view of the device’s performance characteristics. When testing, it is important to use the raw device (/dev/sdX) and to use the direct=1 option with multiple jobs to maximize device performance under stress.

Make sure that the test size (dataset) is large enough to over-run any caches that may apply. We recommend using the ramp-time parameter to allow for enough time for the cache overrun to occur before performance measurements are made. This helps to ensure that performance numbers are not tainted by cache-only performance.

Run fio against all devices in an OSD node simultaneously to identify bottlenecks. It should scale very near linearly; if not, check controller firmware, slot placement, and if necessary, split devices across multiple controllers. This simulates the node under heavy load.

We recommend using the same I/O patterns and block sizes that the individual devices were tested with. The job count should be a multiple of the total number of devices in the system, to allow for even distribution across all devices and buses.

5.1.2.1 Latency Bound and Maximum #

There is value in performing both latency-bound and worst-case-scenario tests. Changing a particular value may not improve latency, but rather enable the cluster to handle even more total load, even though latencies continue to rise. The inverse may also be true, where a change affects latency of operations in a measurable way. To identify both possibilities, we recommend that tests be performed that represent both positions. Latency-bounded tests in fio have the following set:

  latency_target=10ms
  latency_window=5s
  latency_percentile=99

The settings above cause fio to increase the I/O queue depth until 1% or more of IOPS no longer maintain a 10 ms average during a sliding five-second window. It then backs down the queue depth until the latency average is maintained.

5.2 Kernel Tuning #

There are several aspects of the kernel that can be tuned on both the cluster and some of the clients. It is important to understand that most tuning is about gaining very small incremental improvements, which in aggregate represent a measureable (and hopefully meaningful) improvement in performance. For this document, information on tuning comes from a variety of sources. The primary source is https://documentation.suse.com/sles/15-SP1/html/SLES-all/book-sle-tuning.html. We also used numerous other references, including documentation from hardware vendors.

5.2.1 CPU Mitigations #

One key area of kernel performance tuning is to disable its side-channel attack mitigations. The bulk of the benefits from this occur with smaller I/Os, ranging in size from 4 kB to 64 kB. In particular, 64 kB random read and sequential writes doubled in performance in a limited test environment using only two client nodes.

Changing these options requires a clear understanding of the security implications, as they involve disabling mitigations for side-channel attacks on some CPUs. The need to disable these mitigations may be minimized with newer processors. You must carefully evaluate whether this is something needed for the particular hardware being utilized. In the test configuration, a Salt state was utilized to apply these changes. The Salt state should be in a subdirectory of /srv/salt on the Salt master and is applied by using a salt state.apply command similar to below:

salt '*' state.apply my_kerntune

The Salt state and steps used in this testing can be found in Appendix A, Salt State for Kernel Tuning. This needs to be adjusted to work in each customer environment. An example of adjusting the grub2 configuration file can also be found in the appendix.

5.2.2 I/O tuning - Multiqueue Block I/O #

The first aspect to tune is to ensure that I/O is flowing in the most optimal pattern. For the test cluster used in this test, that means enabling multi-queue block I/O. This is done through adding a boot-time kernel parameter, as found in Section 12.4 https://documentation.suse.com/sles/15-SP1/html/SLES-all/cha-tuning-io.html#cha-tuning-io-barrier. This is not an action that should be taken unilaterally on clusters which contain spinning media devices, due to potential performance degradation for those devices. The general result is that there multiple I/O queues are assigned to each device, allowing more jobs to be handled simultaneously by those device that can service large numbers of requests, such as SSD and NVMe drives.

5.2.3 SSD Tuning #

To get the best read performance, it may be necessary to adjust the read_ahead and write cache settings for the SSD devices. In our particular testing environment, disabling write cache and forcing read_ahead to 2 MB resulted in the best overall performance.

Before tuning, it is important to check and record the default values, and measure any differences in performance compared to that baseline.

By placing the following file in /etc/udev/rules.d, devices will be detected according to the model name shown in /sys/block/{devname}/device/model, and instructed to disable write caching and set the read_ahead_kb option to 2 MB.

  /etc/udev/rules.d/99-ssd.rules

  # Setting specific kernel parameters for a subset of block devices (Intel SSDs)
  SUBSYSTEM=="block", ATTRS{model}=="INTEL SSDS*", ACTION=="add|change", ATTR{queue/read_ahead_kb}="2048"
  SUBSYSTEM=="block", ATTRS{model}=="INTEL SSDS*", ACTION=="add|change", RUN+="/sbin/hdparm -W 0 /dev/%k"

5.2.4 Network Stack and Device Tuning #

Proper tuning of the network stack can substantially assist improving the latency and throughput of the cluster. A full script for the testing we performed can be found in Appendix C, Network Tuning.

The first change, and one with the highest impact, is to utilize jumbo frame packets. For this to be done, all interfaces utilizing the cluster must be set to use an MTU of 9000 bytes. Network switches are often set to use 9100 or higher. This is suitable, as they are only passing packets, not creating them.

5.2.4.1 Network Device Tuning #

5.2.4.1.1 Jumbo Frames #

The following Salt command ensures the bonded interfaces on all nodes under control (including the test load generation nodes) were utilizing an MTU of 9000:

salt '*' cmd.run 'ip link set bond0 mtu 9000'

To set this persistently, utilize YaST to set the MTU for the bonded interface.

5.2.4.1.2 PCIe Bus Adjustment #

Adjusting the PCIe maximum read request size can provide a slight boost to performance. Be aware that this tuning is card- and slot-specific and must only be done in conjunction with the conditions and instructions supplied by the manufacturer. The maximum PCIe read request size was set with the following Salt commands:

Warning

This should only be done with guidance from the NIC manufacturer and is specific to bus location, driver version and hardware.

  salt '*' cmd.run 'setpci -s 5b:00.0 68.w=5936'
  salt '*' cmd.run 'setpci -s 5b:00.1 68.w=5936'

5.2.4.1.3 TCP RSS #

The next item on the tuning list is helpful in ensuring that a single CPU core is not responsible for all packet processing. A small script is used to spread the I/O across multiple local (from a NUMA perspective) cores.

Note

This is not necessary if the number of queues returned by ls /sys/class/net/{ifname}/queues/rx-*|wc -l is equal to the number of physical cores in a single CPU socket.

salt '*' cmd.run 'for j in `cat /sys/class/net/bond0/bonding/slaves`;do \
LOCAL_CPUS=`cat /sys/class/net/$j/device/local_cpus`;echo $LOCAL_CPUS > \
/sys/class/net/$j/queues/rx-0/rps_cpus;done'

5.2.4.1.4 Ring Buffers #

Many NIC drivers start with a default value for the receive (RX) and transmit (TX) buffers that is not optimal for high-throughput scenarios, and does not allow enough time for the kernel to drain the buffer before it fills up.

The current and maximum settings can be revealed by issuing the following command to the proper NICs:

ethtool -g eth4

The output from this command should look similar to this:

  Ring parameters for eth4:
  Pre-set maximums:
  RX:		8192
  RX Mini:	0
  RX Jumbo:	0
  TX:		8192
  Current hardware settings:
  RX:		1024
  RX Mini:	0
  RX Jumbo:	0
  TX:		1024

Here we can see that the NIC can allocate buffers of up to 8 kB, but it iscurrently using ones of only 1 kB. To adjust this for the cluster, issue the following command:

  salt '*' cmd.run 'ethtool -G eth4 rx 8192 tx 8192'
  salt '*' cmd.run 'ethtool -G eth5 rx 8192 tx 8192'

Setting this value persistently can be achieved via the YaST configuration module:

Figure 5.1: YaST Configuration Module #

Additionally, the settings can be made persistent by editing the configuration files for the physical interfaces found in /etc/sysconfig/network. A script can be found in Appendix B that will change all interfaces to the maximum ring buffer value.

5.2.4.2 Network Stack #

The following settings can all be made persistent by modifying /etc/sysctl.conf. They are represented as arguments in a Salt command to allow testing and validation in your environment before making them permanent.

5.2.4.2.1 Lower TCP Latency #

Setting the TCP low latency option disables IPv4 TCP pre-queue processing and improves latency. We recommend experimenting with setting this to both 0 and 1. In laboratory testing, setting the value to 1 provided slightly better performance:

  salt '*' cmd.run 'sysctl -w net.ipv4.tcp_low_latency=1'

The TCP fastopen option allows the sending of data in the first syn packet, resulting in a slight improvement in latency:

  salt '*' cmd.run 'sysctl -w net.ipv4.tcp_fastopen=1'

5.2.4.2.2 TCP Stack Buffers #

Ensure that the TCP stack has sufficent buffer space to queue both inbound and outbound traffic:

salt '*' cmd.run 'sysctl -w net.ipv4.tcp_rmem="10240 87380 2147483647"'
salt '*' cmd.run 'sysctl -w net.ipv4.tcp_wmem="10240 87380 2147483647"'

5.2.4.2.3 TCP Sequence and Timestamps #

In fast networks, TCP sequence numbers can be re-used in a very short timeframe. The result is that the system thinks a packet has been received out of order, resulting in a drop. TCP timestamps were added to help ensure that packet sequence could be tracked better:

salt '*' cmd.run 'sysctl -w net.ipv4.tcp_timestamps=1'

TCP Selective Acknowledgement is a feature that is primarily useful for WAN or lower-speed networks. However, disabling it may have negative effects in other ways:

salt '*' cmd.run 'sysctl -w net.ipv4.tcp_sack=1'

5.2.4.2.4 Kernel Network Buffers and Connections #

Providing plenty of buffer space is a recurring theme in tuning networks for high performance. The netdev_max_backlog is where traffic is queued after it has been received by the NIC, but before it is processed by the protocol stack (IP, TCP, etc):

salt '*' cmd.run 'sysctl -w net.core.netdev_max_backlog=250000'

Other preventative measures for the system and gateway nodes include ensuring that the maximum connection count is high enough to prevent the generation of syn cookies. This is useful to set on all nodes involved:

salt '*' cmd.run 'sysctl -w net.core.somaxconn=2048'

Increasing the network stack buffers is useful to ensure that sufficient buffers exist for all transactions:

salt '*' cmd.run 'sysctl -w net.core.rmem_max=2147483647'
salt '*' cmd.run 'sysctl -w net.core.wmem_max=2147483647'