15 Clustered File System #
This chapter describes administration tasks that are normally performed after the cluster is set up and CephFS exported. If you need more information on setting up CephFS, refer to Book “Deployment Guide”, Chapter 11 “Installation of CephFS”.
15.1 Mounting CephFS #
When the file system is created and the MDS is active, you are ready to mount the file system from a client host.
15.1.1 Client Preparation #
If the client host is running SUSE Linux Enterprise 12 SP2 or SP3, you can skip this section as the system is ready to mount CephFS 'out of the box'.
If the client host is running SUSE Linux Enterprise 12 SP1, you need to apply all the latest patches before mounting CephFS.
In any case, everything needed to mount CephFS is included in SUSE Linux Enterprise. The SUSE Enterprise Storage 5.5 product is not needed.
To support the full mount syntax, the
ceph-common package (which is shipped with SUSE Linux Enterprise) should
be installed before trying to mount CephFS.
15.1.2 Create a Secret File #
The Ceph cluster runs with authentication turned on by default. You should create a file that stores your secret key (not the keyring itself). To obtain the secret key for a particular user and then create the file, do the following:
Procedure 15.1: Creating a Secret Key #
View the key for the particular user in a keyring file:
cephadm >cat /etc/ceph/ceph.client.admin.keyringCopy the key of the user who will be using the mounted Ceph FS file system. Usually, the key looks similar to the following:
AQCj2YpRiAe6CxAA7/ETt7Hcl9IyxyYciVs47w==
Create a file with the user name as a file name part, for example
/etc/ceph/admin.secretfor the user admin.Paste the key value to the file created in the previous step.
Set proper access rights to the file. The user should be the only one who can read the file—others may not have any access rights.
15.1.3 Mount CephFS #
You can mount CephFS with the mount command. You need
to specify the monitor host name or IP address. Because the
cephx authentication is enabled by default in
SUSE Enterprise Storage 5.5, you need to specify a user name and their related secret as
well:
root # mount -t ceph ceph_mon1:6789:/ /mnt/cephfs \
-o name=admin,secret=AQATSKdNGBnwLhAAnNDKnH65FmVKpXZJVasUeQ==As the previous command remains in the shell history, a more secure approach is to read the secret from a file:
root # mount -t ceph ceph_mon1:6789:/ /mnt/cephfs \
-o name=admin,secretfile=/etc/ceph/admin.secretNote that the secret file should only contain the actual keyring secret. In our example, the file will then contain only the following line:
AQATSKdNGBnwLhAAnNDKnH65FmVKpXZJVasUeQ==
Tip: Specify Multiple Monitors
It is a good idea to specify multiple monitors separated by commas on the
mount command line in case one monitor happens to be
down at the time of mount. Each monitor address takes the form
host[:port]. If the port is not specified, it defaults
to 6789.
Create the mount point on the local host:
root # mkdir /mnt/cephfsMount the CephFS:
root # mount -t ceph ceph_mon1:6789:/ /mnt/cephfs \
-o name=admin,secretfile=/etc/ceph/admin.secret
A subdirectory subdir may be specified if a subset of
the file system is to be mounted:
root # mount -t ceph ceph_mon1:6789:/subdir /mnt/cephfs \
-o name=admin,secretfile=/etc/ceph/admin.secret
You can specify more than one monitor host in the mount
command:
root # mount -t ceph ceph_mon1,ceph_mon2,ceph_mon3:6789:/ /mnt/cephfs \
-o name=admin,secretfile=/etc/ceph/admin.secretImportant: Read Access to the Root Directory
If clients with path restriction are used, the MDS capabilities need to include read access to the root directory. For example, a keyring may look as follows:
client.bar key: supersecretkey caps: [mds] allow rw path=/barjail, allow r path=/ caps: [mon] allow r caps: [osd] allow rwx
The allow r path=/ part means that path-restricted
clients are able to see the root volume, but cannot write to it. This may
be an issue for use cases where complete isolation is a requirement.
15.2 Unmounting CephFS #
To unmount the CephFS, use the umount command:
root # umount /mnt/cephfs15.3 CephFS in /etc/fstab #
To mount CephFS automatically upon client start-up, insert the
corresponding line in its file systems table
/etc/fstab:
mon1:6790,mon2:/subdir /mnt/cephfs ceph name=admin,secretfile=/etc/ceph/secret.key,noatime,_netdev 0 2
15.4 Multiple Active MDS Daemons (Active-Active MDS) #
CephFS is configured for a single active MDS daemon by default. To scale metadata performance for large-scale systems, you can enable multiple active MDS daemons, which will share the metadata workload with one another.
15.4.1 When to Use Active-Active MDS #
Consider using multiple active MDS daemons when your metadata performance is bottlenecked on the default single MDS.
Adding more daemons does not increase performance on all workload types. For example, a single application running on a single client will not benefit from an increased number of MDS daemons unless the application is doing a lot of metadata operations in parallel.
Workloads that typically benefit from a larger number of active MDS daemons are those with many clients, perhaps working on many separate directories.
15.4.2 Increasing the MDS Active Cluster Size #
Each CephFS file system has a max_mds setting, which
controls how many ranks will be created. The actual number of ranks in the
file system will only be increased if a spare daemon is available to take
on the new rank. For example, if there is only one MDS daemon running and
max_mds is set to two, no second rank will be created.
In the following example, we set the max_mds option to 2
to create a new rank apart from the default one. To see the changes, run
ceph status before and after you set
max_mds, and watch the line containing
fsmap:
cephadm >cephstatus [...] services: [...] mds: cephfs-1/1/1 up {0=node2=up:active}, 1 up:standby [...]cephadm >cephmds set max_mds 2cephadm >cephstatus [...] services: [...] mds: cephfs-2/2/2 up {0=node2=up:active,1=node1=up:active} [...]
The newly created rank (1) passes through the 'creating' state and then enter its 'active' state.
Important: Standby Daemons
Even with multiple active MDS daemons, a highly available system still requires standby daemons to take over if any of the servers running an active daemon fail.
Consequently, the practical maximum of max_mds for highly
available systems is one less than the total number of MDS servers in your
system. To remain available in the event of multiple server failures,
increase the number of standby daemons in the system to match the number
of server failures you need to survive.
15.4.3 Decreasing the Number of Ranks #
All ranks—including the ranks to be removed—must first be
active. This means that you need to have at least max_mds
MDS daemons available.
First, set max_mds to a lower number. For example, go back
to having a single active MDS:
cephadm >cephstatus [...] services: [...] mds: cephfs-2/2/2 up {0=node2=up:active,1=node1=up:active} [...]cephadm >cephmds set max_mds 1cephadm >cephstatus [...] services: [...] mds: cephfs-1/1/1 up {0=node2=up:active}, 1 up:standby [...]
Note that we still have two active MDSs. The ranks still exist even though
we have decreased max_mds, because
max_mds only restricts the creation of new ranks.
Next, use the ceph mds deactivate
rank command to remove the unneeded
rank:
cephadm >cephstatus [...] services: [...] mds: cephfs-2/2/1 up {0=node2=up:active,1=node1=up:active}cephadm >cephmds deactivate 1 telling mds.1:1 192.168.58.101:6805/2799214375 to deactivatecephadm >cephstatus [...] services: [...] mds: cephfs-2/2/1 up {0=node2=up:active,1=node1=up:stopping}cephadm >cephstatus [...] services: [...] mds: cephfs-1/1/1 up {0=node2=up:active}, 1 up:standby
The deactivated rank will first enter the stopping state, for a period of time while it hands off its share of the metadata to the remaining active daemons. This phase can take from seconds to minutes. If the MDS appears to be stuck in the stopping state then that should be investigated as a possible bug.
If an MDS daemon crashes or is terminated while in the 'stopping' state, a standby will take over and the rank will go back to 'active'. You can try to deactivate it again when it has come back up.
When a daemon finishes stopping, it will start again and go back to being a standby.
15.4.4 Manually Pinning Directory Trees to a Rank #
In multiple active metadata server configurations, a balancer runs, which works to spread metadata load evenly across the cluster. This usually works well enough for most users, but sometimes it is desirable to override the dynamic balancer with explicit mappings of metadata to particular ranks. This can allow the administrator or users to evenly spread application load or limit impact of users' metadata requests on the entire cluster.
The mechanism provided for this purpose is called an 'export pin'. It is an
extended attribute of directories. The name of this extended attribute is
ceph.dir.pin. Users can set this attribute using
standard commands:
root # setfattr -n ceph.dir.pin -v 2 /path/to/dir
The value (-v) of the extended attribute is the rank to
assign the directory sub-tree to. A default value of -1 indicates that the
directory is not pinned.
A directory export pin is inherited from its closest parent with a set export pin. Therefore, setting the export pin on a directory affects all of its children. However, the parent's pin can be overridden by setting the child directory export pin. For example:
root # mkdir -p a/b # "a" and "a/b" start with no export pin set.
setfattr -n ceph.dir.pin -v 1 a/ # "a" and "b" are now pinned to rank 1.
setfattr -n ceph.dir.pin -v 0 a/b # "a/b" is now pinned to rank 0
# and "a/" and the rest of its children
# are still pinned to rank 1.15.5 Managing Failover #
If an MDS daemon stops communicating with the monitor, the monitor will wait
mds_beacon_grace seconds (default 15 seconds) before
marking the daemon as laggy. You can configure one or
more 'standby' daemons that will take over during the MDS daemon failover.
15.5.1 Configuring Standby Daemons #
There are several configuration settings that control how a daemon will
behave while in standby. You can specify them in the
ceph.conf on the host where the MDS daemon runs. The
daemon loads these settings when it starts, and sends them to the monitor.
By default, if none of these settings are used, all MDS daemons which do not hold a rank will be used as 'standbys' for any rank.
The settings which associate a standby daemon with a particular name or rank do not guarantee that the daemon will only be used for that rank. They mean that when several standbys are available, the associated standby daemon will be used. If a rank is failed, and a standby is available, it will be used even if it is associated with a different rank or named daemon.
- mds_standby_replay
If set to true, then the standby daemon will continuously read the metadata journal of an up rank. This will give it a warm metadata cache, and speed up the process of failing over if the daemon serving the rank fails.
An up rank may only have one standby replay, daemon assigned to it. If two daemons are both set to be standby replay then one of them will arbitrarily win, and the other will become a normal non-replay standby.
When a daemon has entered the standby replay state, it will only be used as a standby for the rank that it is following. If another rank fails, this standby replay daemon will not be used as a replacement, even if no other standbys are available.
- mds_standby_for_name
Set this to make the standby daemon only take over a failed rank if the last daemon to hold it matches this name.
- mds_standby_for_rank
Set this to make the standby daemon only take over the specified rank. If another rank fails, this daemon will not be used to replace it.
Use in conjunction with
mds_standby_for_fscidto be specific about which file system's rank you are targeting in case of multiple file systems.- mds_standby_for_fscid
If
mds_standby_for_rankis set, this is simply a qualifier to say which file system's rank is being referred to.If
mds_standby_for_rankis not set, then setting FSCID will cause this daemon to target any rank in the specified FSCID. Use this if you have a daemon that you want to use for any rank, but only within a particular file system.- mon_force_standby_active
This setting is used on monitor hosts. It defaults to true.
If it is false, then daemons configured with
standby_replay=truewill only become active if the rank/name that they have been configured to follow fails. On the other hand, if this setting is true, then a daemon configured withstandby_replay=truemay be assigned some other rank.
15.5.2 Examples #
Several example ceph.conf configurations follow. You
can either copy a ceph.conf with the configuration of
all daemons to all your servers, or you can have a different file on each
server that contains that server's daemon configuration.
15.5.2.1 Simple Pair #
Two MDS daemons 'a' and 'b' acting as a pair. Whichever one is not currently assigned a rank will be the standby replay follower of the other.
[mds.a] mds standby replay = true mds standby for rank = 0 [mds.b] mds standby replay = true mds standby for rank = 0