Clustered File System | Administration Guide | SUSE Enterprise Storage 5.5 (SES 5 & SES 5.5)

Applies to SUSE Enterprise Storage 5.5 (SES 5 & SES 5.5)

15 Clustered File System #

This chapter describes administration tasks that are normally performed after the cluster is set up and CephFS exported. If you need more information on setting up CephFS, refer to Book “Deployment Guide”, Chapter 11 “Installation of CephFS”.

15.1 Mounting CephFS #

When the file system is created and the MDS is active, you are ready to mount the file system from a client host.

15.1.1 Client Preparation #

If the client host is running SUSE Linux Enterprise 12 SP2 or SP3, you can skip this section as the system is ready to mount CephFS 'out of the box'.

If the client host is running SUSE Linux Enterprise 12 SP1, you need to apply all the latest patches before mounting CephFS.

In any case, everything needed to mount CephFS is included in SUSE Linux Enterprise. The SUSE Enterprise Storage 5.5 product is not needed.

To support the full mount syntax, the ceph-common package (which is shipped with SUSE Linux Enterprise) should be installed before trying to mount CephFS.

15.1.2 Create a Secret File #

The Ceph cluster runs with authentication turned on by default. You should create a file that stores your secret key (not the keyring itself). To obtain the secret key for a particular user and then create the file, do the following:

Procedure 15.1: Creating a Secret Key #

View the key for the particular user in a keyring file:
```
cephadm > cat /etc/ceph/ceph.client.admin.keyring
```
Copy the key of the user who will be using the mounted Ceph FS file system. Usually, the key looks similar to the following:
```
AQCj2YpRiAe6CxAA7/ETt7Hcl9IyxyYciVs47w==
```
Create a file with the user name as a file name part, for example /etc/ceph/admin.secret for the user admin.
Paste the key value to the file created in the previous step.
Set proper access rights to the file. The user should be the only one who can read the file—others may not have any access rights.

15.1.3 Mount CephFS #

You can mount CephFS with the mount command. You need to specify the monitor host name or IP address. Because the cephx authentication is enabled by default in SUSE Enterprise Storage 5.5, you need to specify a user name and their related secret as well:

root # mount -t ceph ceph_mon1:6789:/ /mnt/cephfs \
 -o name=admin,secret=AQATSKdNGBnwLhAAnNDKnH65FmVKpXZJVasUeQ==

As the previous command remains in the shell history, a more secure approach is to read the secret from a file:

root # mount -t ceph ceph_mon1:6789:/ /mnt/cephfs \
 -o name=admin,secretfile=/etc/ceph/admin.secret

Note that the secret file should only contain the actual keyring secret. In our example, the file will then contain only the following line:

AQATSKdNGBnwLhAAnNDKnH65FmVKpXZJVasUeQ==

Tip: Specify Multiple Monitors

It is a good idea to specify multiple monitors separated by commas on the mount command line in case one monitor happens to be down at the time of mount. Each monitor address takes the form host[:port]. If the port is not specified, it defaults to 6789.

Create the mount point on the local host:

root # mkdir /mnt/cephfs

Mount the CephFS:

root # mount -t ceph ceph_mon1:6789:/ /mnt/cephfs \
 -o name=admin,secretfile=/etc/ceph/admin.secret

A subdirectory subdir may be specified if a subset of the file system is to be mounted:

root # mount -t ceph ceph_mon1:6789:/subdir /mnt/cephfs \
 -o name=admin,secretfile=/etc/ceph/admin.secret

You can specify more than one monitor host in the mount command:

root # mount -t ceph ceph_mon1,ceph_mon2,ceph_mon3:6789:/ /mnt/cephfs \
 -o name=admin,secretfile=/etc/ceph/admin.secret

Important: Read Access to the Root Directory

If clients with path restriction are used, the MDS capabilities need to include read access to the root directory. For example, a keyring may look as follows:

client.bar
 key: supersecretkey
 caps: [mds] allow rw path=/barjail, allow r path=/
 caps: [mon] allow r
 caps: [osd] allow rwx

The allow r path=/ part means that path-restricted clients are able to see the root volume, but cannot write to it. This may be an issue for use cases where complete isolation is a requirement.

15.2 Unmounting CephFS #

To unmount the CephFS, use the umount command:

root # umount /mnt/cephfs

15.3 CephFS in `/etc/fstab` #

To mount CephFS automatically upon client start-up, insert the corresponding line in its file systems table /etc/fstab:

mon1:6790,mon2:/subdir /mnt/cephfs ceph name=admin,secretfile=/etc/ceph/secret.key,noatime,_netdev 0 2

15.4 Multiple Active MDS Daemons (Active-Active MDS) #

CephFS is configured for a single active MDS daemon by default. To scale metadata performance for large-scale systems, you can enable multiple active MDS daemons, which will share the metadata workload with one another.

15.4.1 When to Use Active-Active MDS #

Consider using multiple active MDS daemons when your metadata performance is bottlenecked on the default single MDS.

Adding more daemons does not increase performance on all workload types. For example, a single application running on a single client will not benefit from an increased number of MDS daemons unless the application is doing a lot of metadata operations in parallel.

Workloads that typically benefit from a larger number of active MDS daemons are those with many clients, perhaps working on many separate directories.

15.4.2 Increasing the MDS Active Cluster Size #

Each CephFS file system has a max_mds setting, which controls how many ranks will be created. The actual number of ranks in the file system will only be increased if a spare daemon is available to take on the new rank. For example, if there is only one MDS daemon running and max_mds is set to two, no second rank will be created.

In the following example, we set the max_mds option to 2 to create a new rank apart from the default one. To see the changes, run ceph status before and after you set max_mds, and watch the line containing fsmap:

cephadm > ceph status
  [...]
  services:
    [...]
    mds: cephfs-1/1/1 up  {0=node2=up:active}, 1 up:standby
    [...]
cephadm > ceph mds set max_mds 2
cephadm > ceph status
  [...]
  services:
    [...]
    mds: cephfs-2/2/2 up  {0=node2=up:active,1=node1=up:active}
    [...]

The newly created rank (1) passes through the 'creating' state and then enter its 'active' state.

Important: Standby Daemons

Even with multiple active MDS daemons, a highly available system still requires standby daemons to take over if any of the servers running an active daemon fail.

Consequently, the practical maximum of max_mds for highly available systems is one less than the total number of MDS servers in your system. To remain available in the event of multiple server failures, increase the number of standby daemons in the system to match the number of server failures you need to survive.

15.4.3 Decreasing the Number of Ranks #

All ranks—including the ranks to be removed—must first be active. This means that you need to have at least max_mds MDS daemons available.

First, set max_mds to a lower number. For example, go back to having a single active MDS:

cephadm > ceph status
  [...]
  services:
    [...]
    mds: cephfs-2/2/2 up  {0=node2=up:active,1=node1=up:active}
    [...]
cephadm > ceph mds set max_mds 1
cephadm > ceph status
  [...]
  services:
    [...]
    mds: cephfs-1/1/1 up  {0=node2=up:active}, 1 up:standby
    [...]

Note that we still have two active MDSs. The ranks still exist even though we have decreased max_mds, because max_mds only restricts the creation of new ranks.

Next, use the ceph mds deactivate rank command to remove the unneeded rank:

cephadm > ceph status
  [...]
  services:
    [...]
    mds: cephfs-2/2/1 up  {0=node2=up:active,1=node1=up:active}
cephadm > ceph mds deactivate 1
telling mds.1:1 192.168.58.101:6805/2799214375 to deactivate

cephadm > ceph status
  [...]
  services:
    [...]
    mds: cephfs-2/2/1 up  {0=node2=up:active,1=node1=up:stopping}

cephadm > ceph status
  [...]
  services:
    [...]
    mds: cephfs-1/1/1 up  {0=node2=up:active}, 1 up:standby

The deactivated rank will first enter the stopping state, for a period of time while it hands off its share of the metadata to the remaining active daemons. This phase can take from seconds to minutes. If the MDS appears to be stuck in the stopping state then that should be investigated as a possible bug.

If an MDS daemon crashes or is terminated while in the 'stopping' state, a standby will take over and the rank will go back to 'active'. You can try to deactivate it again when it has come back up.

When a daemon finishes stopping, it will start again and go back to being a standby.

15.4.4 Manually Pinning Directory Trees to a Rank #

In multiple active metadata server configurations, a balancer runs, which works to spread metadata load evenly across the cluster. This usually works well enough for most users, but sometimes it is desirable to override the dynamic balancer with explicit mappings of metadata to particular ranks. This can allow the administrator or users to evenly spread application load or limit impact of users' metadata requests on the entire cluster.

The mechanism provided for this purpose is called an 'export pin'. It is an extended attribute of directories. The name of this extended attribute is ceph.dir.pin. Users can set this attribute using standard commands:

root # setfattr -n ceph.dir.pin -v 2 /path/to/dir

The value (-v) of the extended attribute is the rank to assign the directory sub-tree to. A default value of -1 indicates that the directory is not pinned.

A directory export pin is inherited from its closest parent with a set export pin. Therefore, setting the export pin on a directory affects all of its children. However, the parent's pin can be overridden by setting the child directory export pin. For example:

root # mkdir -p a/b                      # "a" and "a/b" start with no export pin set.
setfattr -n ceph.dir.pin -v 1 a/  # "a" and "b" are now pinned to rank 1.
setfattr -n ceph.dir.pin -v 0 a/b # "a/b" is now pinned to rank 0
                                  # and "a/" and the rest of its children
                                  # are still pinned to rank 1.

15.5 Managing Failover #

If an MDS daemon stops communicating with the monitor, the monitor will wait mds_beacon_grace seconds (default 15 seconds) before marking the daemon as laggy. You can configure one or more 'standby' daemons that will take over during the MDS daemon failover.

15.5.1 Configuring Standby Daemons #

There are several configuration settings that control how a daemon will behave while in standby. You can specify them in the ceph.conf on the host where the MDS daemon runs. The daemon loads these settings when it starts, and sends them to the monitor.

By default, if none of these settings are used, all MDS daemons which do not hold a rank will be used as 'standbys' for any rank.

The settings which associate a standby daemon with a particular name or rank do not guarantee that the daemon will only be used for that rank. They mean that when several standbys are available, the associated standby daemon will be used. If a rank is failed, and a standby is available, it will be used even if it is associated with a different rank or named daemon.

mds_standby_replay

If set to true, then the standby daemon will continuously read the metadata journal of an up rank. This will give it a warm metadata cache, and speed up the process of failing over if the daemon serving the rank fails.

An up rank may only have one standby replay, daemon assigned to it. If two daemons are both set to be standby replay then one of them will arbitrarily win, and the other will become a normal non-replay standby.

When a daemon has entered the standby replay state, it will only be used as a standby for the rank that it is following. If another rank fails, this standby replay daemon will not be used as a replacement, even if no other standbys are available.

mds_standby_for_name

Set this to make the standby daemon only take over a failed rank if the last daemon to hold it matches this name.

mds_standby_for_rank

Set this to make the standby daemon only take over the specified rank. If another rank fails, this daemon will not be used to replace it.

Use in conjunction withmds_standby_for_fscid to be specific about which file system's rank you are targeting in case of multiple file systems.

mds_standby_for_fscid

If mds_standby_for_rank is set, this is simply a qualifier to say which file system's rank is being referred to.

If mds_standby_for_rank is not set, then setting FSCID will cause this daemon to target any rank in the specified FSCID. Use this if you have a daemon that you want to use for any rank, but only within a particular file system.

mon_force_standby_active

This setting is used on monitor hosts. It defaults to true.

If it is false, then daemons configured with standby_replay=true will only become active if the rank/name that they have been configured to follow fails. On the other hand, if this setting is true, then a daemon configured with standby_replay=true may be assigned some other rank.

15.5.2 Examples #

Several example ceph.conf configurations follow. You can either copy a ceph.conf with the configuration of all daemons to all your servers, or you can have a different file on each server that contains that server's daemon configuration.

15.5.2.1 Simple Pair #

Two MDS daemons 'a' and 'b' acting as a pair. Whichever one is not currently assigned a rank will be the standby replay follower of the other.

[mds.a]
mds standby replay = true
mds standby for rank = 0

[mds.b]
mds standby replay = true
mds standby for rank = 0