23 Clustered file system #
This chapter describes administration tasks that are normally performed after the cluster is set up and CephFS exported. If you need more information on setting up CephFS, refer to Book “Deployment Guide”, Chapter 8 “Deploying the remaining core services using cephadm”, Section 8.3.3 “Deploying Metadata Servers”.
23.1 Mounting CephFS #
When the file system is created and the MDS is active, you are ready to mount the file system from a client host.
23.1.1 Preparing the client #
If the client host is running SUSE Linux Enterprise 12 SP2 or later, the system is ready to mount CephFS 'out of the box'.
If the client host is running SUSE Linux Enterprise 12 SP1, you need to apply all the latest patches before mounting CephFS.
In any case, everything needed to mount CephFS is included in SUSE Linux Enterprise. The SUSE Enterprise Storage 7.1 product is not needed.
    To support the full mount syntax, the
    ceph-common package (which is shipped with SUSE Linux Enterprise) should
    be installed before trying to mount CephFS.
   
     Without the ceph-common package (and thus without the
     mount.ceph helper), the monitors' IPs will need to be
     used instead of their names. This is because the kernel client will be
     unable to perform name resolution.
    
The basic mount syntax is:
# mount -t ceph MON1_IP[:PORT],MON2_IP[:PORT],...:CEPHFS_MOUNT_TARGET \
MOUNT_POINT -o name=CEPHX_USER_NAME,secret=SECRET_STRING23.1.2 Creating a secret file #
The Ceph cluster runs with authentication turned on by default. You should create a file that stores your secret key (not the keyring itself). To obtain the secret key for a particular user and then create the file, do the following:
- View the key for the particular user in a keyring file: - cephuser@adm >cat /etc/ceph/ceph.client.admin.keyring
- Copy the key of the user who will be using the mounted Ceph FS file system. Usually, the key looks similar to the following: - AQCj2YpRiAe6CxAA7/ETt7Hcl9IyxyYciVs47w== 
- Create a file with the user name as a file name part, for example - /etc/ceph/admin.secretfor the user admin.
- Paste the key value to the file created in the previous step. 
- Set proper access rights to the file. The user should be the only one who can read the file—others may not have any access rights. 
23.1.3 Mounting CephFS #
    You can mount CephFS with the mount command. You need
    to specify the monitor host name or IP address. Because the
    cephx authentication is enabled by default in
    SUSE Enterprise Storage, you need to specify a user name and their related secret as
    well:
   
# mount -t ceph ceph_mon1:6789:/ /mnt/cephfs \
 -o name=admin,secret=AQATSKdNGBnwLhAAnNDKnH65FmVKpXZJVasUeQ==As the previous command remains in the shell history, a more secure approach is to read the secret from a file:
# mount -t ceph ceph_mon1:6789:/ /mnt/cephfs \
 -o name=admin,secretfile=/etc/ceph/admin.secretNote that the secret file should only contain the actual keyring secret. In our example, the file will then contain only the following line:
AQATSKdNGBnwLhAAnNDKnH65FmVKpXZJVasUeQ==
     It is a good idea to specify multiple monitors separated by commas on the
     mount command line in case one monitor happens to be
     down at the time of mount. Each monitor address takes the form
     host[:port]. If the port is not specified, it defaults
     to 6789.
    
Create the mount point on the local host:
# mkdir /mnt/cephfsMount the CephFS:
# mount -t ceph ceph_mon1:6789:/ /mnt/cephfs \
 -o name=admin,secretfile=/etc/ceph/admin.secret
    A subdirectory subdir may be specified if a subset of
    the file system is to be mounted:
   
# mount -t ceph ceph_mon1:6789:/subdir /mnt/cephfs \
 -o name=admin,secretfile=/etc/ceph/admin.secret
    You can specify more than one monitor host in the mount
    command:
   
# mount -t ceph ceph_mon1,ceph_mon2,ceph_mon3:6789:/ /mnt/cephfs \
 -o name=admin,secretfile=/etc/ceph/admin.secretIf clients with path restriction are used, the MDS capabilities need to include read access to the root directory. For example, a keyring may look as follows:
client.bar key: supersecretkey caps: [mds] allow rw path=/barjail, allow r path=/ caps: [mon] allow r caps: [osd] allow rwx
     The allow r path=/ part means that path-restricted
     clients are able to see the root volume, but cannot write to it. This may
     be an issue for use cases where complete isolation is a requirement.
    
23.2 Unmounting CephFS #
   To unmount the CephFS, use the umount command:
  
# umount /mnt/cephfs23.3 Mounting CephFS in /etc/fstab #
   To mount CephFS automatically upon client start-up, insert the
   corresponding line in its file systems table
   /etc/fstab:
  
mon1:6790,mon2:/subdir /mnt/cephfs ceph name=admin,secretfile=/etc/ceph/secret.key,noatime,_netdev 0 2
23.4 Multiple active MDS daemons (active-active MDS) #
CephFS is configured for a single active MDS daemon by default. To scale metadata performance for large-scale systems, you can enable multiple active MDS daemons, which will share the metadata workload with one another.
23.4.1 Using active-active MDS #
Consider using multiple active MDS daemons when your metadata performance is bottlenecked on the default single MDS.
Adding more daemons does not increase performance on all workload types. For example, a single application running on a single client will not benefit from an increased number of MDS daemons unless the application is doing a lot of metadata operations in parallel.
Workloads that typically benefit from a larger number of active MDS daemons are those with many clients, perhaps working on many separate directories.
23.4.2 Increasing the MDS active cluster size #
    Each CephFS file system has a max_mds setting, which
    controls how many ranks will be created. The actual number of ranks in the
    file system will only be increased if a spare daemon is available to take
    on the new rank. For example, if there is only one MDS daemon running and
    max_mds is set to two, no second rank will be created.
   
    In the following example, we set the max_mds option to 2
    to create a new rank apart from the default one. To see the changes, run
    ceph status before and after you set
    max_mds, and watch the line containing
    fsmap:
   
cephuser@adm >cephstatus [...] services: [...] mds: cephfs-1/1/1 up {0=node2=up:active}, 1 up:standby [...]cephuser@adm >cephfs set cephfs max_mds 2cephuser@adm >cephstatus [...] services: [...] mds: cephfs-2/2/2 up {0=node2=up:active,1=node1=up:active} [...]
The newly created rank (1) passes through the 'creating' state and then enter its 'active' state.
Even with multiple active MDS daemons, a highly available system still requires standby daemons to take over if any of the servers running an active daemon fail.
     Consequently, the practical maximum of max_mds for highly
     available systems is one less than the total number of MDS servers in your
     system. To remain available in the event of multiple server failures,
     increase the number of standby daemons in the system to match the number
     of server failures you need to survive.
    
23.4.3 Decreasing the number of ranks #
    All ranks—including the ranks to be removed—must first be
    active. This means that you need to have at least max_mds
    MDS daemons available.
   
    First, set max_mds to a lower number. For example, go back
    to having a single active MDS:
   
cephuser@adm >cephstatus [...] services: [...] mds: cephfs-2/2/2 up {0=node2=up:active,1=node1=up:active} [...]cephuser@adm >cephfs set cephfs max_mds 1cephuser@adm >cephstatus [...] services: [...] mds: cephfs-1/1/1 up {0=node2=up:active}, 1 up:standby [...]
23.4.4 Manually pinning directory trees to a rank #
In multiple active metadata server configurations, a balancer runs, which works to spread metadata load evenly across the cluster. This usually works well enough for most users, but sometimes it is desirable to override the dynamic balancer with explicit mappings of metadata to particular ranks. This can allow the administrator or users to evenly spread application load or limit impact of users' metadata requests on the entire cluster.
    The mechanism provided for this purpose is called an 'export pin'. It is an
    extended attribute of directories. The name of this extended attribute is
    ceph.dir.pin. Users can set this attribute using
    standard commands:
   
# setfattr -n ceph.dir.pin -v 2 /path/to/dir
    The value (-v) of the extended attribute is the rank to
    assign the directory sub-tree to. A default value of -1 indicates that the
    directory is not pinned.
   
A directory export pin is inherited from its closest parent with a set export pin. Therefore, setting the export pin on a directory affects all of its children. However, the parent's pin can be overridden by setting the child directory export pin. For example:
# mkdir -p a/b                      # "a" and "a/b" start with no export pin set.
setfattr -n ceph.dir.pin -v 1 a/  # "a" and "b" are now pinned to rank 1.
setfattr -n ceph.dir.pin -v 0 a/b # "a/b" is now pinned to rank 0
                                  # and "a/" and the rest of its children
                                  # are still pinned to rank 1.23.5 Managing failover #
   If an MDS daemon stops communicating with the monitor, the monitor will wait
   mds_beacon_grace seconds (default 15 seconds) before
   marking the daemon as laggy. You can configure one or
   more 'standby' daemons that will take over during the MDS daemon failover.
  
23.5.1 Configuring standby replay #
Each CephFS file system may be configured to add standby-replay daemons. These standby daemons follow the active MDS's metadata journal to reduce failover time in the event that the active MDS becomes unavailable. Each active MDS may have only one standby-replay daemon following it.
Configure standby-replay on a file system with the following command:
cephuser@adm > ceph fs set FS-NAME allow_standby_replay BOOLWhen set the monitors will assign available standby daemons to follow the active MDSs in that file system.
When an MDS has entered the standby-replay state, it will only be used as a standby for the rank that it is following. If another rank fails, this standby-replay daemon will not be used as a replacement, even if no other standbys are available. For this reason, it is advised that if standby-replay is used then every active MDS should have a standby-replay daemon.
23.6 Setting CephFS quotas #
You can set quotas on any subdirectory of the Ceph file system. The quota restricts either the number of bytes or files stored beneath the specified point in the directory hierarchy.
23.6.1 CephFS quota limitations #
Using quotas with CephFS has the following limitations:
- Quotas are cooperative and non-competing.
- Ceph quotas rely on the client that is mounting the file system to stop writing to it when a limit is reached. The server part cannot prevent a malicious client from writing as much data as it needs. Do not use quotas to prevent filling the file system in environments where the clients are fully untrusted. 
- Quotas are imprecise.
- Processes that are writing to the file system will be stopped shortly after the quota limit is reached. They will inevitably be allowed to write some amount of data over the configured limit. Client writers will be stopped within tenths of seconds after crossing the configured limit. 
- Quotas are implemented in the kernel client from version 4.17.
- Quotas are supported by the user space client (libcephfs, ceph-fuse). Linux kernel clients 4.17 and higher support CephFS quotas on SUSE Enterprise Storage 7.1 clusters. Kernel clients (even recent versions) will fail to handle quotas on older clusters, even if they are able to set the quotas extended attributes. SLE12-SP3 (and later) kernels already include the required backports to handle quotas. 
- Configure quotas carefully when used with path-based mount restrictions.
- The client needs to have access to the directory inode on which quotas are configured in order to enforce them. If the client has restricted access to a specific path (for example - /home/user) based on the MDS capability, and a quota is configured on an ancestor directory they do not have access to (- /home), the client will not enforce it. When using path-based access restrictions, be sure to configure the quota on the directory that the client can access (for example- /home/useror- /home/user/quota_dir).
23.6.2 Configuring CephFS quotas #
You can configure CephFS quotas by using virtual extended attributes:
- ceph.quota.max_files
- Configures a file limit. 
- ceph.quota.max_bytes
- Configures a byte limit. 
If the attributes appear on a directory inode, a quota is configured there. If they are not present then no quota is set on that directory (although one may still be configured on a parent directory).
To set a 100 MB quota, run:
cephuser@mds > setfattr -n ceph.quota.max_bytes -v 100000000 /SOME/DIRECTORYTo set a 10,000 files quota, run:
cephuser@mds > setfattr -n ceph.quota.max_files -v 10000 /SOME/DIRECTORYTo view quota setting, run:
cephuser@mds > getfattr -n ceph.quota.max_bytes /SOME/DIRECTORYcephuser@mds > getfattr -n ceph.quota.max_files /SOME/DIRECTORYIf the value of the extended attribute is '0', the quota is not set.
To remove a quota, run:
cephuser@mds >setfattr -n ceph.quota.max_bytes -v 0 /SOME/DIRECTORYcephuser@mds >setfattr -n ceph.quota.max_files -v 0 /SOME/DIRECTORY
23.7 Managing CephFS snapshots #
CephFS snapshots create a read-only view of the file system at the point in time they are taken. You can create a snapshot in any directory. The snapshot will cover all data in the file system under the specified directory. After creating a snapshot, the buffered data is flushed out asynchronously from various clients. As a result, creating a snapshot is very fast.
If you have multiple CephFS file systems sharing a single pool (via name spaces), their snapshots will collide, and deleting one snapshot will result in missing file data for other snapshots sharing the same pool.
23.7.1 Creating snapshots #
The CephFS snapshot feature is enabled by default on new file systems. To enable it on existing file systems, run:
cephuser@adm > ceph fs set CEPHFS_NAME allow_new_snaps true
    After you enable snapshots, all directories in the CephFS will have a
    special .snap subdirectory.
   
     This is a virtual subdirectory. It does not appear in
     the directory listing of the parent directory, but the name
     .snap cannot be used as a file or directory name. To
     access the .snap directory one needs to explicitly
     access it, for example:
    
> ls -la /CEPHFS_MOUNT/.snap/CephFS kernel clients have a limitation: they cannot handle more than 400 snapshots in a file system. The number of snapshots should always be kept below this limit, regardless of which client you are using. If using older CephFS clients, such as SLE12-SP3, keep in mind that going above 400 snapshots is harmful to operations as the client will crash.
     You may configure a different name for the snapshots subdirectory by
     setting the client snapdir setting.
    
    To create a snapshot, create a subdirectory under the
    .snap directory with a custom name. For example, to
    create a snapshot of the directory
    /CEPHFS_MOUNT/2/3/, run:
   
> mkdir /CEPHFS_MOUNT/2/3/.snap/CUSTOM_SNAPSHOT_NAME23.7.2 Deleting snapshots #
    To delete a snapshot, remove its subdirectory inside the
    .snap directory:
   
> rmdir /CEPHFS_MOUNT/2/3/.snap/CUSTOM_SNAPSHOT_NAME