5 Slurm — utility for HPC workload management #
Slurm is a workload manager for managing compute jobs on High Performance Computing clusters. It can start multiple jobs on a single node, or a single job on multiple nodes. Additional components can be used for advanced scheduling and accounting.
The mandatory components of Slurm are the control daemon
slurmctld, which handles job scheduling, and the
Slurm daemon slurmd, responsible for launching compute
jobs. Nodes running slurmctld are called
management servers and nodes running
slurmd are called compute nodes.
Additional components are a secondary slurmctld acting as a standby server for a failover, and the Slurm database daemon slurmdbd, which stores the job history and user hierarchy.
For further documentation, see the Quick Start Administrator Guide and Quick Start User Guide. There is further in-depth documentation on the Slurm documentation page.
5.1 Installing Slurm #
These instructions describe a minimal installation of Slurm with one management server and multiple compute nodes.
5.1.1 Minimal installation #
For security reasons, Slurm does not run as the user
root, but under its own
user. It is important that the user
slurm has the same UID/GID
across all nodes of the cluster.
If this user/group does not exist, the package slurm creates this user and group when it is installed. However, this does not guarantee that the generated UIDs/GIDs will be identical on all systems.
Therefore, we strongly advise you to create the user/group
slurm before installing
slurm. If you are using a network directory service
such as LDAP for user and group management, you can use it to provide the
slurm user/group as well.
It is strongly recommended that all compute nodes share common user home directories. These should be provided through network storage.
On the management server, install the slurm package with the command
zypper in slurm.On the compute nodes, install the slurm-node package with the command
zypper in slurm-node.On the management server and the compute nodes, the package munge is installed automatically. Configure, enable and start MUNGE on the management server and compute nodes as described in Section 3.4, “MUNGE authentication”. Ensure that the same
mungekey is shared across all nodes.
On the management server, edit the main configuration file
/etc/slurm/slurm.conf:Configure the parameter
SlurmctldHost=SLURMCTLD_HOSTwith the host name of the management server.To find the correct host name, run
hostname -son the management server.Under the
COMPUTE NODESsection, add the following lines to define the compute nodes:NodeName=NODE_LIST State=UNKNOWN PartitionName=normal Nodes=NODE_LIST Default=YES MaxTime=24:00:00 State=UP
Replace NODE_LIST with the host names of the compute nodes, either comma-separated or as a range (for example:
node[1-100]).The
NodeNameline also allows specifying additional parameters for the nodes, such asBoards,SocketsPerBoardCoresPerSocket,ThreadsPerCore, orCPU. The actual values of these can be obtained by running the following command on the compute nodes:node1#slurmd -C
Copy the modified configuration file
/etc/slurm/slurm.conffrom the management server to all compute nodes:management#scp /etc/slurm/slurm.conf COMPUTE_NODE:/etc/slurm/On the management server, start
slurmctldand enable it so that it starts on every boot:management#systemctl enable --now slurmctld.serviceOn each compute node, start
slurmdand enable it so that it starts on every boot:node1#systemctl enable --now slurmd.service
Check the status and availability of the compute nodes by running the
sinfocommand. You should see output similar to the following:PARTITION AVAIL TIMELIMIT NODES STATE NODELIST normal* up 1-00:00:00 2 idle node[01-02]
If the node state is not
idle, see Section 5.4, “Frequently asked questions”.Test the Slurm installation by running the following command:
management#srun sleep 30This runs the
sleepcommand on a free compute node for 30 seconds.In another shell, run the
squeuecommand during the 30 seconds that the compute node is asleep. You should see output similar to the following:JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 1 normal sleep root R 0:05 1 node02Create the following shell script and save it as
sleeper.sh:#!/bin/bash echo "started at $(date)" sleep 30 echo "finished at $(date)"
Run the shell script in the queue:
management#sbatch sleeper.shThe shell script is executed when enough resources are available, and the output is stored in the file
slurm-${JOBNR}.out.
5.1.2 Installing the Slurm database #
In a minimal installation, Slurm only stores pending and running jobs. To store finished and failed job data, the storage plugin must be installed and enabled. You can also enable completely fair scheduling, which replaces FIFO (first in, first out) scheduling with algorithms that calculate the job priority in a queue in dependence of the job which a user has run in the history.
The Slurm database has two components: the slurmdbd
daemon itself, and an SQL database. MariaDB is recommended. The
database can be installed on the same node that runs slurmdbd,
or on a separate node. For a minimal setup, all these services run on the
management server.
Before you begin, make sure Slurm is installed as described in Section 5.1.1, “Minimal installation”.
If you want to use an external SQL database (or you already have a database installed on the management server), you can skip Step 1 and Step 2.
Install the MariaDB SQL database:
management#zypper install mariadbStart and enable MariaDB:
management#systemctl enable --now mariadbSecure the database:
management#mysql_secure_installationConnect to the SQL database:
management#mysql -u root -pCreate the Slurm database user and grant it permissions for the Slurm database, which will be created later:
mysql>create user 'slurm'@'localhost' identified by 'PASSWORD';mysql>grant all on slurm_acct_db.* TO 'slurm'@'localhost';You can choose to use a different user name or database name. In this case, you must also change the corresponding values in the
/etc/slurm/slurmdbd.conffile later.Exit the database:
mysql>exitInstall the slurmdbd package:
management#zypper in slurm-slurmdbdEdit the
/etc/slurm/slurmdbd.conffile so that the daemon can access the database. Change the following line to the password that you used in Step 5:StoragePass=password
If the database is on a different node, or if you chose a different user name or database name, you must also modify the following lines:
StorageUser=slurm StorageLoc=slurm_acct_db DbdAddr=localhost DbdHost=localhost
Start and enable
slurmdbd:management#systemctl enable --now slurmdbdThe first start of
slurmdbdmight take some time.To enable accounting, edit the
/etc/slurm/slurm.conffile to add the connection betweenslurmctldand theslurmdbddaemon. Ensure that the following lines appear as shown:JobAcctGatherType=jobacct_gather/linux JobAcctGatherFrequency=30 AccountingStorageType=accounting_storage/slurmdbd AccountingStorageHost=localhost
This example assumes that
slurmdbdis running on the same node asslurmctld. If not, changelocalhostto the host name or IP address of the node whereslurmdbdis running.Make sure
slurmdbdis running before you continue:management#systemctl status slurmdbdIf you restart
slurmctldbeforeslurmdbdis running,slurmctldfails because it cannot connect to the database.Restart
slurmctld:management#systemctl restart slurmctldThis creates the Slurm database and adds the cluster to the database (using the
ClusterNamefrom/etc/slurm/slurm.conf).(Optional) By default, Slurm does not take any group membership into account, and the system groups cannot be mapped to Slurm. However, you can mimic system groups with accounts. In Slurm, accounts are usually entities billed for cluster usage, while users identify individual cluster users. Multiple users can be associated with a single account.
The following example creates an umbrella group
bavariafor two subgroups callednurembergandmunich:management#sacctmgr add account bavaria \ Description="umbrella group for subgroups" Organization=bavariamanagement#sacctmgr add account nuremberg,munich parent=bavaria \ Description="subgroup" Organization=bavariaThe following example adds a user called
tuxto the subgroupnuremberg:management#sacctmgr add user tux Account=nuremberg
5.2 Slurm administration commands #
This section lists some useful options for common Slurm commands. For more
information and a full list of options, see the man page
for each command. For more Slurm commands, see
https://slurm.schedmd.com/man_index.html.
5.2.1 scontrol #
The command scontrol is used to show and update the
entities of Slurm, such as the state of the compute nodes or compute jobs.
It can also be used to reboot or to propagate configuration changes to the
compute nodes.
Useful options for this command are --details, which
prints more verbose output, and --oneliner, which forces
the output onto a single line, which is more useful in shell scripts.
For more information, see man scontrol.
scontrol show ENTITYDisplays the state of the specified ENTITY.
scontrol update SPECIFICATIONUpdates the SPECIFICATION like the compute node or compute node state. Useful SPECIFICATION states that can be set for compute nodes include:
nodename=NODE state=down reason=REASONRemoves all jobs from the compute node, and aborts any jobs already running on the node.
nodename=NODE state=drain reason=REASONDrains the compute node so that no new jobs can be scheduled on it, but does not end compute jobs already running on the compute node. REASON can be any string. The compute node stays in the
drainedstate and must be returned to theidlestate manually.nodename=NODE state=resumeMarks the compute node as ready to return to the
idlestate.jobid=JOBID REQUIREMENT=VALUEUpdates the given requirement, such as
NumNodes, with a new value. This command can also be run as a non-privileged user.
scontrol reconfigureTriggers a reload of the configuration file
slurm.confon all compute nodes.scontrol reboot NODELISTReboots a compute node, or group of compute nodes, when the jobs on it finish. To use this command, the option
RebootProgram="/sbin/reboot"must be set inslurm.conf. When the reboot of a compute node takes more than 60 seconds, you can set a higher value inslurm.conf, such asResumeTimeout=300.
5.2.2 sinfo #
The command sinfo retrieves information about the state
of the compute nodes, and can be used for a fast overview of the cluster
health. For more information, see man sinfo.
--deadDisplays information about unresponsive nodes.
--longShows more detailed information.
--reservationPrints information about advanced reservations.
-RDisplays the reason a node is in the
down,drained, orfailingstate.--state=STATELimits the output only to nodes with the specified STATE.
5.2.3 sacctmgr and sacct #
These commands are used for managing accounting. For more information, see
man sacctmgr and man sacct.
sacctmgrUsed for job accounting in Slurm. To use this command, the service
slurmdbdmust be set up. See Section 5.1.2, “Installing the Slurm database”.sacctDisplays the accounting data if accounting is enabled.
--allusersShows accounting data for all users.
--accounts=NAMEShows only the specified user(s).
--starttime=MM/DD[/YY]-HH:MM[:SS]Shows only jobs after the specified start time. You can use just MM/DD or HH:MM. If no time is given, the command defaults to
00:00, which means that only jobs from today are shown.--endtime=MM/DD[/YY]-HH:MM[:SS]Accepts the same options as
--starttime. If no time is given, the time when the command was issued is used.--name=NAMELimits output to jobs with the given NAME.
--partition=PARTITIONShows only jobs that run in the specified PARTITION.
5.2.4 sbatch, salloc, and srun #
These commands are used to schedule compute jobs,
which means batch scripts for the sbatch command,
interactive sessions for the salloc command, or
binaries for the srun command. If the job cannot be
scheduled immediately, only sbatch places it into the queue.
For more information, see man sbatch,
man salloc, and man srun.
-n COUNT_THREADSSpecifies the number of threads needed by the job. The threads can be allocated on different nodes.
-N MINCOUNT_NODES[-MAXCOUNT_NODES]Sets the number of compute nodes required for a job. The MAXCOUNT_NODES number can be omitted.
--time TIMESpecifies the maximum clock time (runtime) after which a job is terminated. The format of TIME is either seconds or [HH:]MM:SS. Not to be confused with
walltime, which isclocktime × threads.--signal [B:]NUMBER[@TIME]Sends the signal specified by NUMBER 60 seconds before the end of the job, unless TIME is specified. The signal is sent to every process on every node. If a signal should only be sent to the controlling batch job, you must specify the
B:flag.--job-name NAMESets the name of the job to NAME in the queue.
--array=RANGEINDEXExecutes the given script via
sbatchfor indexes given by RANGEINDEX with the same parameters.--dependency=STATE:JOBIDDefers the job until the specified STATE of the job JOBID is reached.
--gres=GRESRuns a job only on nodes with the specified generic resource (GRes), for example a GPU, specified by the value of GRES.
--licenses=NAME[:COUNT]The job must have the specified number (COUNT) of licenses with the name NAME. A license is the opposite of a generic resource: it is not tied to a computer, but is a cluster-wide variable.
--mem=MEMORYSets the real MEMORY required by a job per node. To use this option, memory control must be enabled. The default unit for the MEMORY value is megabytes, but you can also use
Kfor kilobyte,Mfor megabyte,Gfor gigabyte, orTfor terabyte.--mem-per-cpu=MEMORYThis option takes the same values as
--mem, but defines memory on a per-CPU basis rather than a per-node basis.
5.3 Upgrading Slurm #
For existing products under general support, version upgrades of Slurm are
provided regularly. Unlike maintenance updates, these upgrades are not
installed automatically using zypper patch but require
you to request their installation explicitly. This ensures that these
upgrades are not installed unintentionally and gives you the opportunity
to plan version upgrades beforehand.
zypper up is not recommended
On systems running Slurm, updating packages with zypper up
is not recommended. zypper up attempts to update all installed
packages to the latest version, so might install a new major version of Slurm
outside of planned Slurm upgrades.
Use zypper patch instead, which only updates packages to the
latest bug fix version.
5.3.1 Slurm upgrade workflow #
Interoperability is guaranteed between three consecutive versions of Slurm, with the following restrictions:
The version of
slurmdbdmust be identical to or higher than the version ofslurmctld.The version of
slurmctldmust the identical to or higher than the version ofslurmd.The version of
slurmdmust be identical to or higher than the version of theslurmuser applications.
Or in short:
version(slurmdbd) >=
version(slurmctld) >=
version(slurmd) >= version (Slurm user CLIs).
Slurm uses a segmented version number: the first two segments denote the
major version, and the final segment denotes the patch level.
Upgrade packages (that is, packages that were not initially supplied with
the module or service pack) have their major version encoded in the package
name (with periods . replaced by underscores
_). For example, for version 23.02, this would be
slurm_23_02-*.rpm. To find out the latest version of Slurm,
you can check ›
in the SUSE Customer Center, or run zypper search -v slurm on a node.
With each version, configuration options for
slurmctld, slurmd, or
slurmdbd might be deprecated. While deprecated, they
remain valid for this version and the two consecutive versions, but they might
be removed later. Therefore, it is advisable to update the configuration files
after the upgrade and replace deprecated configuration options before the
final restart of a service.
A new major version of Slurm introduces a new version of
libslurm. Older versions of this library might not work
with an upgraded Slurm. An upgrade is provided for all SUSE Linux Enterprise software that
depends on libslurm . It is strongly recommended to rebuild
local applications using libslurm, such as MPI libraries
with Slurm support, as early as possible. This might require updating the
user applications, as new arguments might be introduced to existing functions.
slurmdbd databases before other Slurm components
If slurmdbd is used, always upgrade the
slurmdbd database before starting
the upgrade of any other Slurm component. The same database can be connected
to multiple clusters and must be upgraded before all of them.
Upgrading other Slurm components before the database can lead to data loss.
5.3.2 Upgrading the slurmdbd database daemon #
When upgrading slurmdbd,
the database is converted when the new version of
slurmdbd starts for the first time. If the
database is big, the conversion could take several tens of minutes. During
this time, the database is inaccessible.
It is highly recommended to create a backup of the database in case an
error occurs during or after the upgrade process. Without a backup,
all accounting data collected in the database might be lost if an error
occurs or the upgrade is rolled back. A database
converted to a newer version cannot be converted back to an older version,
and older versions of slurmdbd do not recognize the
newer formats.
slurmdbd first
If you are using a backup slurmdbd, the conversion must
be performed on the primary slurmdbd first. The backup
slurmdbd only starts after the conversion is complete.
slurmdbd database daemon #Stop the
slurmdbdservice:DBnode#rcslurmdbd stopEnsure that
slurmdbdis not running anymore:DBnode#rcslurmdbd statusslurmctldmight remain running while the database daemon is down. During this time, requests intended forslurmdbdare queued internally. The DBD Agent Queue size is limited, however, and should therefore be monitored withsdiag.Create a backup of the
slurm_acct_dbdatabase:DBnode#mysqldump -p slurm_acct_db > slurm_acct_db.sqlIf needed, this can be restored with the following command:
DBnode#mysql -p slurm_acct_db < slurm_acct_db.sqlDuring the database conversion, the variable
innodb_buffer_pool_sizemust be set to a value of 128 MB or more. Check the current size:DBnode#echo 'SELECT @@innodb_buffer_pool_size/1024/1024;' | \ mysql --password --batchIf the value of
innodb_buffer_pool_sizeis less than 128 MB, you can change it for the duration of the current session (onmariadb):DBnode#echo 'set GLOBAL innodb_buffer_pool_size = 134217728;' | \ mysql --password --batchAlternatively, to permanently change the size, edit the
/etc/my.cnffile, setinnodb_buffer_pool_sizeto 128 MB, then restart the database:DBnode#rcmysql restartIf you need to update MariaDB, run the following command:
DBnode#zypper update mariadbConvert the database tables to the new version of MariaDB:
DBnode#mysql_upgrade --user=root --password=ROOT_DB_PASSWORD;Install the new version of
slurmdbd:DBnode#zypper install --force-resolution slurm_VERSION-slurmdbdRebuild the database. If you are using a backup
slurmdbd, perform this step on the primaryslurmdbdfirst.Because a conversion might take a considerable amount of time, the
systemdservice might time out during the conversion. Therefore, we recommend performing the migration manually by runningslurmdbdfrom the command line in the foreground:DBnode#/usr/sbin/slurmdbd -D -vWhen you see the following message, you can shut down
slurmdbdby pressing Ctrl–C:Conversion done: success!
Before restarting the service, remove or replace any deprecated configuration options. Check the deprecated options in the Release Notes.
Restart
slurmdbd:DBnode#systemctl start slurmdbdNote: No daemonization during rebuildDuring the rebuild of the Slurm database, the database daemon does not daemonize.
5.3.3 Upgrading slurmctld and slurmd #
After the Slurm database is upgraded, the slurmctld and
slurmd instances can be upgraded. It is recommended to
update the management servers and compute nodes all at once.
If this is not feasible, the compute nodes (slurmd) can
be updated on a node-by-node basis. However, the management servers
(slurmctld) must be updated first.
Section 5.3.2, “Upgrading the
slurmdbddatabase daemon”. Upgrading other Slurm components before the database can lead to data loss.This procedure assumes that MUNGE authentication is used and that
pdsh, thepdshSlurm plugin, andmrshcan access all of the machines in the cluster. If this is not the case, installpdshby runningzypper in pdsh-slurm.If
mrshis not used in the cluster, thesshback-end forpdshcan be used instead. Replace the option-R mrshwith-R sshin thepdshcommands below. This is less scalable and you might run out of usable ports.
slurmctld and slurmd #Back up the configuration file
/etc/slurm/slurm.conf. Because this file should be identical across the entire cluster, it is sufficient to do so only on the main management server.On the main management server, edit
/etc/slurm/slurm.confand setSlurmdTimeoutandSlurmctldTimeoutto sufficiently high values to avoid timeouts whileslurmctldandslurmdare down:SlurmctldTimeout=3600 SlurmdTimeout=3600
We recommend at least 60 minutes (
3600), and more for larger clusters.Copy the updated
/etc/slurm/slurm.conffrom the management server to all nodes:Obtain the list of partitions in
/etc/slurm/slurm.conf.Copy the updated configuration to the compute nodes:
management#cp /etc/slurm/slurm.conf /etc/slurm/slurm.conf.updatemanagement#sudo -u slurm /bin/bash -c 'cat /etc/slurm/slurm.conf.update | \ pdsh -R mrsh -P PARTITIONS "cat > /etc/slurm/slurm.conf"'management#rm /etc/slurm/slurm.conf.updateReload the configuration file on all compute nodes:
management#scontrol reconfigureVerify that the reconfiguration took effect:
management#scontrol show config | grep Timeout
Shut down all running
slurmctldinstances, first on any backup management servers, and then on the main management server:management#systemctl stop slurmctldBack up the
slurmctldstate files.slurmctldmaintains persistent state information. Almost every major version involves changes to theslurmctldstate files. This state information is upgraded if the upgrade remains within the supported version range and no data is lost.However, if a downgrade is necessary, state information from newer versions is not recognized by an older version of
slurmctldand is discarded, resulting in a loss of all running and pending jobs. Therefore, back up the old state in case an update needs to be rolled back.Determine the
StateSaveLocationdirectory:management#scontrol show config | grep StateSaveLocationCreate a backup of the content of this directory. If a downgrade is required, restore the content of the
StateSaveLocationdirectory from this backup.
Shut down
slurmdon the compute nodes:management#pdsh -R ssh -P PARTITIONS systemctl stop slurmdUpgrade
slurmctldon the main and backup management servers:management#zypper install --force-resolution slurm_VERSIONImportant: Upgrade all Slurm packages at the same timeIf any additional Slurm packages are installed, you must upgrade those as well. This includes:
slurm-pam_slurm
slurm-sview
perl-slurm
slurm-lua
slurm-torque
slurm-config-man
slurm-doc
slurm-webdoc
slurm-auth-none
pdsh-slurm
All Slurm packages must be upgraded at the same time to avoid conflicts between packages of different versions. This can be done by adding them to the
zypper installcommand line described above.Upgrade
slurmdon the compute nodes:management#pdsh -R ssh -P PARTITIONS \ zypper install --force-resolution slurm_VERSION-nodeNote: Memory size seen byslurmdmight change on updateUnder certain circumstances, the amount of memory seen by
slurmdmight change after an update. If this happens,slurmctldputs the nodes in adrainedstate. To check whether the amount of memory seem byslurmdchanged after the update, run the following command on a single compute node:node1#slurmd -CCompare the output with the settings in
slurm.conf. If required, correct the setting.Before restarting the service, remove or replace any deprecated configuration options. Check the deprecated options in the Release Notes.
If you replace deprecated options in the configuration files, these configuration files can be distributed to all management servers and compute nodes in the cluster by using the method described in Step 3.
Restart
slurmdon all compute nodes:management#pdsh -R ssh -P PARTITIONS systemctl start slurmdRestart
slurmctldon the main and backup management servers:management#systemctl start slurmctldCheck the status of the management servers. On the main and backup management servers, run the following command:
management#systemctl status slurmctldVerify that the services are running without errors. Run the following command to check whether there are any
down,drained,failing, orfailednodes:management#sinfo -RRestore the original values of
SlurmdTimeoutandSlurmctldTimeoutin/etc/slurm/slurm.conf, then copy the restored configuration to all nodes by using the method described in Step 3.
5.4 Frequently asked questions #
- 1.
How do I change the state of a node from
downtoup? When the
slurmddaemon on a node does not reboot in the time specified in theResumeTimeoutparameter, or theReturnToServicewas not changed in the configuration fileslurm.conf, compute nodes stay in thedownstate and must be set back to theupstate manually. This can be done for the NODE with the following command:management#scontrol update state=resume NodeName=NODE
- 2.
What is the difference between the states
downanddown*? A
*shown after a status code means that the node is not responding.When a node is marked as
down*, it means that the node is not reachable because of network issues, or thatslurmdis not running on that node.In the
downstate, the node is reachable, but either the node was rebooted unexpectedly, the hardware does not match the description inslurm.conf, or a health check was configured with theHealthCheckProgram.
- 3. How do I get the exact core count, socket number, and number of CPUs for a node?
To find the node values that go into the configuration file
slurm.conf, run the following command:node1#slurmd -C