This is a draft document that was built and uploaded automatically. It may document beta software and be incomplete or even incorrect. Use this document at your own risk.

Jump to contentJump to page navigation: previous page [access key p]/next page [access key n]
Storage Administration Guide / File systems and mounting / Multi-tier caching for block device operations
Applies to SUSE Linux Enterprise Server 15 SP6

4 Multi-tier caching for block device operations

A multi-tier cache is a replicated/distributed cache that consists of at least two tiers: one is represented by slower but cheaper rotational block devices (hard disks), while the other is more expensive but performs faster data operations (for example SSD flash disks).

SUSE Linux Enterprise Server implements two different solutions for caching between flash and rotational devices: bcache and lvmcache.

4.1 General terminology

This section explains several terms often used when describing cache related features:

Migration

Movement of the primary copy of a logical block from one device to the other.

Promotion

Migration from the slow device to the fast device.

Demotion

Migration from the fast device to the slow device.

Origin device

The big and slower block device. It always contains a copy of the logical block, which may be out of date or kept in synchronization with the copy on the cache device (depending on policy).

Cache device

The small and faster block device.

Metadata device

A small device that records which blocks are in the cache, which are dirty, and extra hints for use by the policy object. This information could be put on the cache device as well, but having it separate allows the volume manager to configure it differently, for example as a mirror for extra robustness. The metadata device may only be used by a single cache device.

Dirty block

If some process writes to a block of data which is placed in the cache, the cached block is marked as dirty because it was overwritten in the cache and needs to be written back to the original device.

Cache miss

A request for I/O operations is pointed to the cached device's cache first. If it cannot find the requested values, it looks in the device itself, which is slow. This is called a cache miss.

Cache hit

When a requested value is found in the cached device's cache, it is served fast. This is called a cache hit.

Cold cache

Cache that holds no values (is empty) and causes cache misses. As the cached block device operations progress, it gets filled with data and becomes warm.

Warm cache

Cache that already holds some values and is likely to result in cache hits.

4.2 Caching modes

Following are the basic caching modes that multi-tier caches use: write-back, write-through, write-around and pass-through.

write-back

Data written to a block that is cached go to the cache only, and the block is marked dirty. This is the default caching mode.

write-through

Writing to a cached block will not complete until it has hit both the origin and cache devices. Clean blocks remain clean with write-through cache.

write-around

A similar technique to write-through cache, but write I/O is written directly to a permanent storage, bypassing the cache. This can prevent the cache being flooded with write I/O that will not subsequently be re-read, but the disadvantage is that a read request for recently written data will create a 'cache miss' and needs to be read from slower bulk storage and experience higher latency.

pass-through

To enable the pass-through mode, the cache needs to be clean. Reading is served from the origin device bypassing the cache. Writing is forwarded to the origin device and 'invalidates' the cache block. Pass-through allows a cache device activation without having to care about data coherency, which is maintained. The cache will gradually become cold as writing takes place. If you can verify the coherency of the cache later, or establish it by using the invalidate_cblocks message, you can switch the cache device to write-through or write-back mode while it is still warm. Otherwise, you can discard the cache contents before switching to the desired caching mode.

4.3 bcache

bcache is a Linux kernel block layer cache. It allows one or more fast disk drives (such as SSDs) to act as a cache for one or more slower hard disks. bcache supports write-through and write-back, and is independent of the file system used. By default it caches random reads and writes only, which SSDs excel at. It is suitable for desktops, servers, and high end storage arrays as well.

4.3.1 Main features

  • A single cache device can be used to cache an arbitrary number of backing devices. Backing devices can be attached and detached at runtime, while mounted and in use.

  • Recovers from unclean shutdowns—writes are not completed until the cache is consistent with regard to the backing device.

  • Throttles traffic to the SSD if it becomes congested.

  • Highly efficient write-back implementation. Dirty data is always written out in sorted order.

  • Stable and reliable—in production use.

4.3.2 Setting up a bcache device

This section describes steps to set up and manage a bcache device.

  1. Install the bcache-tools package:

    > sudo zypper in bcache-tools
  2. Create a backing device (typically a mechanical drive). The backing device can be a whole device, a partition, or any other standard block device.

    > sudo make-bcache -B /dev/sdb
  3. Create a cache device (typically an SSD disk).

    > sudo make-bcache -C /dev/sdc

    In this example, the default block and bucket sizes of 512 B and 128 KB are used. The block size should match the backing device's sector size which will usually be either 512 or 4k. The bucket size should match the erase block size of the caching device with the intention of reducing write amplification. For example, using a hard disk with 4k sectors and an SSD with an erase block size of 2 MB this command would look as follows:

    sudo make-bcache --block 4k --bucket 2M -C /dev/sdc
    Tip
    Tip: Multi-device support

    make-bcache can prepare and register multiple backing devices and a cache device at the same time. In this case you do not need to manually attach the cache device to the backing device afterward:

    > sudo make-bcache -B /dev/sda /dev/sdb -C /dev/sdc
  4. bcache devices show up as

    /dev/bcacheN

    and as

    /dev/bcache/by-uuid/UUID
    /dev/bcache/by-label/LABEL

    You can normally format and mount bcache devices as usual:

    > sudo mkfs.ext4 /dev/bcache0
    > sudo mount /dev/bcache0 /mnt

    You can control bcache devices through sysfs at /sys/block/bcacheN/bcache.

  5. After both the cache and backing devices are registered, you need to attach the backing device to the related cache set to enable caching:

    > echo CACHE_SET_UUID > /sys/block/bcache0/bcache/attach

    where CACHE_SET_UUID is found in /sys/fs/bcache.

  6. By default bcache uses a pass-through caching mode. To change it to for example write-back, run

    > echo writeback > /sys/block/bcache0/bcache/cache_mode

4.3.3 bcache configuration using sysfs

bcache devices use the sysfs interface to store their runtime configuration values. This way you can change bcache backing and cache disks' behavior or see their usage statistics.

For the complete list of bcache sysfs parameters, see the contents of the /usr/src/linux/Documentation/bcache.txt file, mainly the SYSFS - BACKING DEVICE, SYSFS - BACKING DEVICE STATS, and SYSFS - CACHE DEVICE sections.

4.4 lvmcache

lvmcache is a caching mechanism consisting of logical volumes (LVs). It uses the dm-cache kernel driver and supports write-through (default) and write-back caching modes. lvmcache improves performance of a large and slow LV by dynamically migrating some of its data to a faster and smaller LV. For more information on LVM, see Part II, “Logical volumes (LVM)”.

LVM refers to the small, fast LV as a cache pool LV. The large, slow LV is called the origin LV. Because of requirements from dm-cache, LVM further splits the cache pool LV into two devices: the cache data LV and cache metadata LV. The cache data LV is where copies of data blocks are kept from the origin LV to increase speed. The cache metadata LV holds the accounting information that specifies where data blocks are stored.

4.4.1 Configuring lvmcache

This section describes steps to create and configure LVM based caching.

  1. Create the origin LV. Create a new LV or use an existing LV to become the origin LV:

    > sudo lvcreate -n ORIGIN_LV -L 100G vg /dev/SLOW_DEV
  2. Create the cache data LV. This LV will hold data blocks from the origin LV. The size of this LV is the size of the cache and will be reported as the size of the cache pool LV.

    > sudo lvcreate -n CACHE_DATA_LV -L 10G vg /dev/FAST
  3. Create the cache metadata LV. This LV will hold cache pool metadata. The size of this LV should be approximately 1000 times smaller than the cache data LV, with a minimum size of 8MB.

    > sudo lvcreate -n CACHE_METADATA_LV -L 12M vg /dev/FAST

    List the volumes you have created so far:

    > sudo lvs -a vg
    LV                VG   Attr        LSize   Pool Origin
    cache_data_lv     vg   -wi-a-----  10.00g
    cache_metadata_lv vg   -wi-a-----  12.00m
    origin_lv         vg   -wi-a----- 100.00g
  4. Create a cache pool LV. Combine the data and metadata LVs into a cache pool LV. You can set the cache pool LV's behavior at the same time.

    CACHE_POOL_LV takes the name of CACHE_DATA_LV.

    CACHE_DATA_LV is renamed to CACHE_DATA_LV_cdata and becomes hidden.

    CACHE_META_LV is renamed to CACHE_DATA_LV_cmeta and becomes hidden.

    > sudo lvconvert --type cache-pool \
     --poolmetadata vg/cache_metadata_lv vg/cache_data_lv
    > sudo lvs -a vg
    LV                     VG   Attr       LSize   Pool Origin
    cache_data_lv          vg   Cwi---C---  10.00g
    [cache_data_lv_cdata]  vg   Cwi-------  10.00g
    [cache_data_lv_cmeta]  vg   ewi-------  12.00m
    origin_lv              vg   -wi-a----- 100.00g
  5. Create a cache LV. Create a cache LV by linking the cache pool LV to the origin LV.

    The user accessible cache LV takes the name of the origin LV, while the origin LV becomes a hidden LV renamed to ORIGIN_LV_corig.

    CacheLV takes the name of ORIGIN_LV.

    ORIGIN_LV is renamed to ORIGIN_LV_corig and becomes hidden.

    > sudo lvconvert --type cache --cachepool vg/cache_data_lv vg/origin_lv
    > sudo lvs -a vg
    LV              VG   Attr       LSize   Pool   Origin
    cache_data_lv          vg   Cwi---C---  10.00g
    [cache_data_lv_cdata]  vg   Cwi-ao----  10.00g
    [cache_data_lv_cmeta]  vg   ewi-ao----  12.00m
    origin_lv              vg   Cwi-a-C--- 100.00g cache_data_lv [origin_lv_corig]
    [origin_lv_corig]      vg   -wi-ao---- 100.00g

4.4.2 Removing a cache pool

There are several ways to turn off the LV cache.

4.4.2.1 Detach a cache pool LV from a cache LV

You can disconnect a cache pool LV from a cache LV, leaving an unused cache pool LV and an uncached origin LV. Data are written back from the cache pool to the origin LV when necessary.

> sudo lvconvert --splitcache vg/origin_lv

4.4.2.2 Removing a cache pool LV without removing its origin LV

This writes back data from the cache pool to the origin LV when necessary, then removes the cache pool LV, leaving the uncached origin LV.

> sudo lvremove vg/cache_data_lv

An alternative command that also disconnects the cache pool from the cache LV, and deletes the cache pool:

> sudo lvconvert --uncache vg/origin_lv

4.4.2.3 Removing both the origin LV and the cache pool LV

Removing a cache LV removes both the origin LV and the linked cache pool LV.

> sudo lvremove vg/origin_lv

4.4.2.4 More information

You can find more lvmcache related topics, such as supported cache modes, redundant sub-logical volumes, cache policy, or converting existing LVs to cache types, in the lvmcache manual page (man 7 lvmcache).