16 Tuning the memory management subsystem #
To understand and tune the memory management behavior of the kernel, it is important to first have an overview of how it works and cooperates with other subsystems.
The memory management subsystem, also called the virtual memory manager, will subsequently be called “VM”. The role of the VM is to manage the allocation of physical memory (RAM) for the entire kernel and user programs. It is also responsible for providing a virtual memory environment for user processes (managed via POSIX APIs with Linux extensions). Finally, the VM frees up RAM when there is a shortage, either by trimming caches or swapping out “anonymous” memory.
The most important thing to understand when examining and tuning VM is how its caches are managed. The basic goal of the VM's caches is to minimize the cost of I/O as generated by swapping and file system operations (including network file systems). This is achieved by avoiding I/O or by submitting I/O in better patterns.
Free memory is used and filled up by these caches as required. The more memory is available for caches and anonymous memory, the more effectively the caches and swapping operate. However, if a memory shortage is encountered, the caches are trimmed or the memory is swapped out.
For a particular workload, the first thing that can be done to improve performance is to increase memory and reduce the frequency that memory must be trimmed or swapped. The second thing is to change the way caches are managed by changing kernel parameters.
Finally, the workload itself should be examined and tuned as well. If an application is allowed to run more processes or threads, effectiveness of VM caches can be reduced, if each process is operating in its own area of the file system. Memory overheads are also increased. If applications allocate their own buffers or caches, larger caches mean that less memory is available for VM caches. However, more processes and threads can mean more opportunity to overlap and pipeline I/O, and may take better advantage of multiple cores. Experimentation is required for the best results.
16.1 Memory usage #
Memory allocations can be characterized as “pinned” (also known as “unreclaimable”), “reclaimable” or “swappable”.
16.1.1 Anonymous memory #
    Anonymous memory tends to be program heap and stack memory (for example,
    >malloc()). It is reclaimable, except in special
    cases such as mlock or if there is no available swap
    space. Anonymous memory must be written to swap before it can be
    reclaimed. Swap I/O (both swapping in and swapping out pages) tends to
    be less efficient than pagecache I/O, because of allocation and access
    patterns.
   
16.1.2 Pagecache #
A cache of file data. When a file is read from disk or network, the contents are stored in pagecache. No disk or network access is required, if the contents are up-to-date in pagecache. tmpfs and shared memory segments count toward pagecache.
When a file is written to, the new data is stored in pagecache before being written back to a disk or the network (making it a write-back cache). When a page has new data not written back yet, it is called “dirty”. Pages not classified as dirty are “clean”. Clean pagecache pages can be reclaimed if there is a memory shortage by simply freeing them. Dirty pages must first be made clean before being reclaimed.
16.1.3 Buffercache #
This is a type of pagecache for block devices (for example, /dev/sda). A file system typically uses the buffercache when accessing its on-disk metadata structures such as inode tables, allocation bitmaps, and so forth. Buffercache can be reclaimed similarly to pagecache.
16.1.4 Buffer heads #
Buffer heads are small auxiliary structures that tend to be allocated upon pagecache access. They can generally be reclaimed easily when the pagecache or buffercache pages are clean.
16.1.5 Writeback #
As applications write to files, the pagecache becomes dirty and the buffercache may become dirty. When the amount of dirty memory reaches a specified number of pages in bytes (vm.dirty_background_bytes), or when the amount of dirty memory reaches a specific ratio to total memory (vm.dirty_background_ratio), or when the pages have been dirty for longer than a specified amount of time (vm.dirty_expire_centisecs), the kernel begins writeback of pages starting with files that had the pages dirtied first. The background bytes and ratios are mutually exclusive and setting one will overwrite the other. Flusher threads perform writeback in the background and allow applications to continue running. If the I/O cannot keep up with applications dirtying pagecache, and dirty data reaches a critical setting (vm.dirty_bytes or vm.dirty_ratio), then applications begin to be throttled to prevent dirty data exceeding this threshold.
16.1.6 Readahead #
The VM monitors file access patterns and may attempt to perform readahead. Readahead reads pages into the pagecache from the file system that have not been requested yet. It is done to allow fewer, larger I/O requests to be submitted (more efficient). And for I/O to be pipelined (I/O performed at the same time as the application is running).
16.1.7 VFS caches #
16.1.7.1 Inode cache #
This is an in-memory cache of the inode structures for each file system. These contain attributes such as the file size, permissions and ownership, and pointers to the file data.
16.1.7.2 Directory entry cache #
This is an in-memory cache of the directory entries in the system. These contain a name (the name of a file), the inode which it refers to, and children entries. This cache is used when traversing the directory structure and accessing a file by name.
16.2 Reducing memory usage #
16.2.1 Reducing malloc (anonymous) usage #
    Applications running on SUSE Linux Enterprise Server 15 SP6 can allocate
    more memory compared to older releases. This is because of
    glibc changing its default
    behavior while allocating user space memory. See
    https://www.gnu.org/s/libc/manual/html_node/Malloc-Tunable-Parameters.html
    for explanation of these parameters.
   
    To restore behavior similar to older releases, M_MMAP_THRESHOLD should
    be set to 128*1024. This can be done with mallopt() call from the
    application, or via setting MALLOC_MMAP_THRESHOLD_
    environment variable before running the application.
   
16.2.2 Reducing kernel memory overheads #
Kernel memory that is reclaimable (caches, described above) is trimmed automatically during memory shortages. Most other kernel memory cannot be easily reduced but is a property of the workload given to the kernel.
Reducing the requirements of the user space workload reduces the kernel memory usage (fewer processes, fewer open files and sockets, etc.).
16.2.3 Memory controller (memory cgroups) #
If the memory cgroups feature is not needed, it can be switched off by passing cgroup_disable=memory on the kernel command line, reducing memory consumption of the kernel a bit. There is also a slight performance benefit as there is a small amount of accounting overhead when memory cgroups are available even if none are configured.
16.3 Virtual memory manager (VM) tunable parameters #
When tuning the VM, it should be understood that certain changes take time to affect the workload and take full effect. If the workload changes throughout the day, it may behave differently at different times. A change that increases throughput under certain conditions may decrease it under other conditions.
16.3.1 Reclaim ratios #
- /proc/sys/vm/swappiness
- This control is used to define how aggressively the kernel swaps out anonymous memory relative to pagecache and other caches. Increasing the value increases the amount of swapping. The default value is - 60.- Swap I/O tends to be much less efficient than other I/O. However, certain pagecache pages are accessed much more frequently than less used anonymous memory. The right balance should be found here. - If swap activity is observed during slowdowns, it may be worth reducing this parameter. If there is a lot of I/O activity and the amount of pagecache in the system is rather small, or if there are large dormant applications running, increasing this value can improve performance. - The more data is swapped out, the longer the system takes to swap data back in when it is needed. 
- /proc/sys/vm/vfs_cache_pressure
- This variable controls the tendency of the kernel to reclaim the memory which is used for caching of VFS caches, versus pagecache and swap. Increasing this value increases the rate at which VFS caches are reclaimed. - It is difficult to know when this should be changed, other than by experimentation. The - slabtopcommand (part of the package- procps) shows top memory objects used by the kernel. The vfs caches are the "dentry" and the "*_inode_cache" objects. If these are consuming a large amount of memory in relation to pagecache, it may be worth trying to increase pressure. Could also help to reduce swapping. The default value is- 100.
- /proc/sys/vm/min_free_kbytes
- This controls the amount of memory that is kept free for use by special reserves including “atomic” allocations (those which cannot wait for reclaim). This should not normally be lowered unless the system is being carefully tuned for memory usage (normally useful for embedded rather than server applications). If “page allocation failure” messages and stack traces are frequently seen in logs, min_free_kbytes could be increased until the errors disappear. There is no need for concern if these messages are infrequent. The default value depends on the amount of RAM. 
- /proc/sys/vm/watermark_scale_factor
- Broadly speaking, free memory has high, low and min watermarks. When the low watermark is reached then - kswapdwakes to reclaim memory in the background. It stays awake until free memory reaches the high watermark. Applications will stall and reclaim memory when the min watermark is reached.- The - watermark_scale_factordefines the amount of memory left in a node/system before kswapd is woken up and how much memory needs to be free before kswapd goes back to sleep. The unit is in fractions of 10,000. The default value of 10 means the distances between watermarks are 0.1% of the available memory in the node/system. The maximum value is 1000, or 10% of memory.- Workloads that frequently stall in direct reclaim, accounted by - allocstallin- /proc/vmstat, may benefit from altering this parameter. Similarly, if- kswapdis sleeping prematurely, as accounted for by- kswapd_low_wmark_hit_quickly, then it may indicate that the number of pages kept free to avoid stalls is too low.
16.3.2 Writeback parameters #
One important change in writeback behavior since SUSE Linux Enterprise Server 10 is that modification to file-backed mmap() memory is accounted immediately as dirty memory (and subject to writeback). Whereas previously it would only be subject to writeback after it was unmapped, upon an msync() system call, or under heavy memory pressure.
Some applications do not expect mmap modifications to be subject to such writeback behavior, and performance can be reduced. Increasing writeback ratios and times can improve this type of slowdown.
- /proc/sys/vm/dirty_background_ratio
- This is the percentage of the total amount of free and reclaimable memory. When the amount of dirty pagecache exceeds this percentage, writeback threads start writing back dirty memory. The default value is - 10(%).
- /proc/sys/vm/dirty_background_bytes
- This contains the amount of dirty memory at which the background kernel flusher threads start writeback. - dirty_background_bytesis the counterpart of- dirty_background_ratio. If one of them is set, the other one will automatically be read as- 0.
- /proc/sys/vm/dirty_ratio
- Similar percentage value as for - dirty_background_ratio. When this is exceeded, applications that want to write to the pagecache are blocked and wait for kernel background flusher threads to reduce the amount of dirty memory. The default value is- 20(%).
- /proc/sys/vm/dirty_bytes
- This file controls the same tunable as - dirty_ratiohowever the amount of dirty memory is in bytes as opposed to a percentage of reclaimable memory. Since both- dirty_ratioand- dirty_bytescontrol the same tunable, if one of them is set, the other one is automatically read as- 0. The minimum value allowed for- dirty_bytesis two pages (in bytes); any value lower than this limit is ignored and the old configuration will be retained.
- /proc/sys/vm/dirty_expire_centisecs
- The data which has been dirty in-memory for longer than this interval is written out next time a flusher thread wakes up. Expiration is measured based on the modification time of a file's inode. Therefore, multiple dirtied pages from the same file are all written when the interval is exceeded. 
    dirty_background_ratio and
    dirty_ratio together determine the pagecache
    writeback behavior. If these values are increased, more dirty memory is
    kept in the system for a longer time. With more dirty memory allowed in
    the system, the chance to improve throughput by avoiding writeback I/O
    and to submitting more optimal I/O patterns increases. However, more
    dirty memory can either harm latency when memory needs to be reclaimed
    or at points of data integrity (“synchronization points”) when it
    needs to be written back to disk.
   
16.3.3 Timing differences of I/O writes between SUSE Linux Enterprise 12 and SUSE Linux Enterprise 11 #
    The system is required to limit what percentage of the system's memory
    contains file-backed data that needs writing to disk. This guarantees
    that the system can always allocate the necessary data structures to
    complete I/O. The maximum amount of memory that can be dirty and
    requires writing at any time is controlled by
    vm.dirty_ratio
    (/proc/sys/vm/dirty_ratio). The defaults are:
   
SLE-11-SP3: vm.dirty_ratio = 40 SLE-12: vm.dirty_ratio = 20
    The primary advantage of using the lower ratio in SUSE Linux Enterprise 12 is that
    page reclamation and allocation in low memory situations completes
    faster as there is a higher probability that old clean pages are
    quickly found and discarded. The secondary advantage is that if all
    data on the system must be synchronized, then the time to complete the
    operation on SUSE Linux Enterprise 12 is lower than SUSE Linux Enterprise 11 SP3 by default.
    Most workloads will not notice this change as data is synchronized with
    fsync() by the application or data is not dirtied
    quickly enough to hit the limits.
   
    There are exceptions, and if your application is affected by this, it
    can manifest as an unexpected stall during writes. To prove it is
    affected by dirty data rate limiting then monitor
    /proc/PID_OF_APPLICATION/stack
    and it will be observed that the application spends significant time in
    balance_dirty_pages_ratelimited. If this is observed
    and it is a problem, then increase the value of
    vm.dirty_ratio to 40 to restore the SUSE Linux Enterprise 11 SP3
    behavior.
   
The overall I/O throughput is the same regardless of the setting. The only difference is the timing of when the I/O is queued.
    This is an example of using dd to asynchronously
    write 30% of memory to disk which would happen to be affected by the
    change in vm.dirty_ratio:
   
#MEMTOTAL_MBYTES=`free -m | grep Mem: | awk '{print $2}'`#sysctl vm.dirty_ratio=40#dd if=/dev/zero of=zerofile ibs=1048576 count=$((MEMTOTAL_MBYTES*30/100)) 2507145216 bytes (2.5 GB) copied, 8.00153 s, 313 MB/s#sysctl vm.dirty_ratio=20 dd if=/dev/zero of=zerofile ibs=1048576 count=$((MEMTOTAL_MBYTES*30/100)) 2507145216 bytes (2.5 GB) copied, 10.1593 s, 247 MB/s
    The parameter affects the time it takes for the command to
    complete and the apparent write speed of the device. With
    dirty_ratio=40, more of the data is cached and
    written to disk in the background by the kernel. The speed of I/O is identical in both cases. To
    demonstrate, this is the result when dd synchronizes
    the data before exiting:
   
#sysctl vm.dirty_ratio=40#dd if=/dev/zero of=zerofile ibs=1048576 count=$((MEMTOTAL_MBYTES*30/100)) conv=fdatasync 2507145216 bytes (2.5 GB) copied, 21.0663 s, 119 MB/s#sysctl vm.dirty_ratio=20#dd if=/dev/zero of=zerofile ibs=1048576 count=$((MEMTOTAL_MBYTES*30/100)) conv=fdatasync 2507145216 bytes (2.5 GB) copied, 21.7286 s, 115 MB/s
    As observed, dirty_ratio had almost no impact here and
    is within the natural variability of a command. Hence,
    dirty_ratio does not directly impact I/O performance
    but it may affect the apparent performance of a workload that writes
    data asynchronously without synchronizing.
   
16.3.4 Readahead parameters #
- /sys/block/<bdev>/queue/read_ahead_kb
- If one or more processes are sequentially reading a file, the kernel reads certain data in advance (ahead) to reduce the amount of time that processes need to wait for data to be available. The actual amount of data being read in advance is computed dynamically, based on the extent of sequentiality of the I/O. This parameter sets the maximum amount of data that the kernel reads ahead for a single file. If you observe that large sequential reads from a file are not fast enough, you can try increasing this value. Increasing it too far may result in readahead thrashing where pagecache used for readahead is reclaimed before it can be used, or slowdowns because of a large amount of useless I/O. The default value is - 512(KB).
16.3.5 Transparent HugePage parameters #
     Transparent HugePages (THP) provide a way to dynamically allocate huge
     pages either on‑demand by the process or deferring the allocation
     until later via the khugepaged kernel thread. This
     method is distinct from the use of hugetlbfs to
     manually manage their allocation and use. Workloads with contiguous memory
     access patterns can benefit greatly from THP. A 1000-fold decrease in page
     faults can be observed when running synthetic workloads with contiguous
     memory access patterns.
   
There are cases when THP may be undesirable. Workloads with sparse memory access patterns can perform poorly with THP due to excessive memory usage. For example, 2 MB of memory may be used at fault time instead of 4 KB for each fault and ultimately lead to premature page reclaim. On releases older than SUSE Linux Enterprise 12 SP2, it was possible for an application to stall for long periods of time trying to allocate a THP which frequently led to a recommendation of disabling THP. Such recommendations should be re-evaluated for SUSE Linux Enterprise 12 SP3 and later releases.
     The behavior of THP may be configured via the
     transparent_hugepage= kernel parameter or via
     sysfs. For example, it may be disabled by adding the kernel parameter
     transparent_hugepage=never, rebuilding your grub2
     configuration, and rebooting. Verify if THP is disabled with:
    
# cat /sys/kernel/mm/transparent_hugepage/enabled
always madvise [never]
      If disabled, the value never is shown
      in square brackets like in the example above. A value of
      always mandatorily tries and uses THP at fault
      time but defers to khugepaged if the allocation
      fails. A value of madvise will only allocate THP
      for address spaces explicitly specified by an application.
     
- /sys/kernel/mm/transparent_hugepage/defrag
- This parameter controls how much effort an application commits when allocating a THP. A value of - alwaysis the default for SUSE Linux Enterprise 12 SP1 and earlier releases that supported THP. If a THP is not available, the application tries to defragment memory. It potentially incurs large stalls in an application if the memory is fragmented and a THP is not available.- A value of - madvisemeans that THP allocation requests will only defragment if the application explicitly requests it. This is the default for SUSE Linux Enterprise 12 SP2 and later releases.- deferis only available on SUSE Linux Enterprise 12 SP2 and later releases. If a THP is not available, the application falls back to using small pages if a THP is not available. It wakes the- kswapdand- kcompactdkernel threads to defragment memory in the background and a THP will be allocated later by- khugepaged.- The final option - neveruses small pages if a THP is unavailable but no other action will take place.
16.3.6 khugepaged parameters #
    khugepaged is automatically started when
    transparent_hugepage is set to
    always or madvise, and it will be
    automatically shut down if it is set to never. Normally
    this runs at low frequency but the behavior can be tuned.
   
- /sys/kernel/mm/transparent_hugepage/khugepaged/defrag
- A value of 0 will disable - khugepagedeven though THP may still be used at fault time. This may be important for latency-sensitive applications that benefit from THP but cannot tolerate a stall if- khugepagedtries to update an application memory usage.
- /sys/kernel/mm/transparent_hugepage/khugepaged/pages_to_scan
- This parameter controls how many pages are scanned by - khugepagedin a single pass. A scan identifies small pages that can be reallocated as THP. Increasing this value will allocate THP in the background faster at the cost of CPU usage.
- /sys/kernel/mm/transparent_hugepage/khugepaged/scan_sleep_millisecs
- khugepagedsleeps for a short interval specified by this parameter after each pass to limit how much CPU usage is used. Reducing this value allocates THP in the background faster at the cost of CPU usage. A value of 0 will force continual scanning.
- /sys/kernel/mm/transparent_hugepage/khugepaged/alloc_sleep_millisecs
- This parameter controls how long - khugepagedwill sleep in the event it fails to allocate a THP in the background waiting for- kswapdand- kcompactdto take action.
     The remaining parameters for khugepaged are rarely
     useful for performance tuning but are fully documented in
     /usr/src/linux/Documentation/vm/transhuge.txt
   
16.3.7 Further VM parameters #
    For the complete list of the VM tunable parameters, see
    /usr/src/linux/Documentation/sysctl/vm.txt
    (available after having installed the
    kernel-source package).
   
16.4 Monitoring VM behavior #
Some simple tools that can help monitor VM behavior:
- vmstat: This tool gives a good overview of what the VM is doing. See Section 2.1.1, “ - vmstat” for details.
- /proc/meminfo: This file gives a detailed breakdown of where memory is being used. See Section 2.4.2, “Detailed memory usage:- /proc/meminfo” for details.
- slabtop: This tool provides detailed information about kernel slab memory usage. buffer_head, dentry, inode_cache, ext3_inode_cache, etc. are the major caches. This command is available with the package- procps.
- /proc/vmstat: This file gives a detailed breakdown of internal VM behavior. The information contained within is implementation specific and may not always be available. Some information is duplicated in- /proc/meminfoand other information can be presented in a friendly fashion by utilities. For maximum utility, this file needs to be monitored over time to observe rates of change. The most important pieces of information that are hard to derive from other sources are as follows:- pgscan_kswapd_*, pgsteal_kswapd_*
- These report respectively the number of pages scanned and reclaimed by - kswapdsince the system started. The ratio between these values can be interpreted as the reclaim efficiency with a low efficiency implying that the system is struggling to reclaim memory and may be thrashing. Light activity here is generally not something to be concerned with.
- pgscan_direct_*, pgsteal_direct_*
- These report respectively the number of pages scanned and reclaimed by an application directly. This is correlated with increases in the - allocstallcounter. This is more serious than- kswapdactivity as these events indicate that processes are stalling. Heavy activity here combined with- kswapdand high rates of- pgpgin,- pgpoutand/or high rates of- pswapinor- pswpoutare signs that a system is thrashing heavily.- More detailed information can be obtained using tracepoints. 
- thp_fault_alloc, thp_fault_fallback
- These counters correspond to how many THPs were allocated directly by an application and how many times a THP was not available and small pages were used. Generally a high fallback rate is harmless unless the application is sensitive to TLB pressure. 
- thp_collapse_alloc, thp_collapse_alloc_failed
- These counters correspond to how many THPs were allocated by - khugepagedand how many times a THP was not available and small pages were used. A high fallback rate implies that the system is fragmented and THPs are not being used even when the memory usage by applications would allow them. It is only a problem for applications that are sensitive to TLB pressure.
- compact_*_scanned, compact_stall, compact_fail, compact_success
- These counters may increase when THP is enabled and the system is fragmented. - compact_stallis incremented when an application stalls allocating THP. The remaining counters account for pages scanned, the number of defragmentation events that succeeded or failed.