6 Monitoring and logging #
Obtaining and maintaining an overview over the status and health of a cluster's compute nodes helps to ensure a smooth operation. This chapter describes tools that give an administrator an overview of the current cluster status, collect system logs, and gather information on certain system failure conditions.
6.1 ConMan — the console manager #
ConMan is a serial console management program designed to support many console devices and simultaneous users. It supports:
- local serial devices 
- remote terminal servers (via the telnet protocol) 
- IPMI Serial-Over-LAN (via FreeIPMI) 
- Unix domain sockets 
- external processes (for example, using - expectscripts for- telnet,- ssh, or- ipmi-solconnections)
ConMan can be used for monitoring, logging, and optionally timestamping console device output.
   To install ConMan, run zypper in conman.
  
conmand sends unencrypted data
    The daemon conmand sends
    unencrypted data over the
    network and its connections are not authenticated. Therefore, it should
    be used locally only, listening to the port
    localhost. However, the IPMI console does offer
    encryption. This makes conman a good tool for
    monitoring many such consoles.
   
   ConMan provides expect-scripts in the
   directory /usr/lib/conman/exec.
  
   Input to conman is not echoed in interactive mode.
   This can be changed by entering the escape sequence
   &E.
  
When pressing Enter in interactive mode, no line feed is generated. To generate a line feed, press Ctrl–L.
For more information about options, see the ConMan man page.
6.2 Monitoring HPC clusters with Prometheus and Grafana #
Monitor the performance of HPC clusters using Prometheus and Grafana.
Prometheus collects metrics from exporters running on cluster nodes and stores the data in a time series database. Grafana provides data visualization dashboards for the metrics collected by Prometheus. Preconfigured dashboards are available on the Grafana website.
The following Prometheus exporters are useful for High Performance Computing:
- Slurm exporter
- Extracts job and job queue status metrics from the Slurm workload manager. Install this exporter on a node that has access to the Slurm command line interface. 
- Node exporter
- Extracts hardware and kernel performance metrics directly from each compute node. Install this exporter on every compute node you want to monitor. 
It is recommended that the monitoring data only be accessible from within a trusted environment (for example, using a login node or VPN). It should not be accessible from the internet without additional security hardening measures for access restriction, access control, and encryption.
- Grafana: https://grafana.com/docs/grafana/latest/getting-started/ 
- Grafana dashboards: https://grafana.com/grafana/dashboards 
- Prometheus: https://prometheus.io/docs/introduction/overview/ 
- Prometheus exporters: https://prometheus.io/docs/instrumenting/exporters/ 
- Slurm exporter: https://github.com/vpenso/prometheus-slurm-exporter 
- Node exporter: https://github.com/prometheus/node_exporter 
6.2.1 Installing Prometheus and Grafana #
Install Prometheus and Grafana on a management server, or on a separate monitoring node.
- You have an installation source for Prometheus and Grafana: - The packages are available from SUSE Package Hub. To install SUSE Package Hub, see https://packagehub.suse.com/how-to-use/. 
- If you have a subscription for SUSE Manager, the packages are available from the SUSE Manager Client Tools repository. 
 
In this procedure, replace MNTRNODE with the host name or IP address of the server where Prometheus and Grafana are installed.
- Install the Prometheus and Grafana packages: - monitor#zypper in golang-github-prometheus-prometheus grafana
- Enable and start Prometheus: - monitor#systemctl enable --now prometheus
- Verify that Prometheus works: - In a browser, navigate to - MNTRNODE:9090/config, or:
- In a terminal, run the following command: - >wget MNTRNODE:9090/config --output-document=-
 - Either of these methods should show the default contents of the - /etc/prometheus/prometheus.ymlfile.
- Enable and start Grafana: - monitor#systemctl enable --now grafana-server
- Log in to the Grafana web server at - MNTRNODE:3000.- Use - adminfor both the user name and password, then change the password when prompted.
- Click . 
- Find Prometheus and click . 
- In the field, enter - http://localhost:9090. The default settings for the other fields can remain unchanged.- If Prometheus and Grafana are installed on different servers, replace - localhostwith the host name or IP address of the server where Prometheus is installed.
- Click . 
You can now configure Prometheus to collect metrics from the cluster, and add dashboards to Grafana to visualize those metrics.
6.2.2 Monitoring cluster workloads #
To monitor the status of the nodes and jobs in an HPC cluster, install the Prometheus Slurm exporter to collect workload data, then import a custom Slurm dashboard from the Grafana website to visualize the data. For more information about this dashboard, see https://grafana.com/grafana/dashboards/4323.
You must install the Slurm exporter on a node that has access to the Slurm command line interface. In the following procedure, the Slurm exporter will be installed on a management server.
- Section 6.2.1, “Installing Prometheus and Grafana” is complete. 
- The Slurm workload manager is fully configured. 
- You have internet access and policies that allow you to download the dashboard from the Grafana website. 
In this procedure, replace MGMTSERVER with the host name or IP address of the server where the Slurm exporter is installed, and replace MNTRNODE with the host name or IP address of the server where Grafana is installed.
- Install the Slurm exporter: - management#zypper in golang-github-vpenso-prometheus_slurm_exporter
- Enable and start the Slurm exporter: - management#systemctl enable --now prometheus-slurm_exporterImportant: Slurm exporter fails when GPU monitoring is enabled- In Slurm 20.11, the Slurm exporter fails when GPU monitoring is enabled. - This feature is disabled by default. Do not enable it for this version of Slurm. 
- Verify that the Slurm exporter works: - In a browser, navigate to - MNGMTSERVER:8080/metrics, or:
- In a terminal, run the following command: - >wget MGMTSERVER:8080/metrics --output-document=-
 - Either of these methods should show output similar to the following: - # HELP go_gc_duration_seconds A summary of the GC invocation durations. # TYPE go_gc_duration_seconds summary go_gc_duration_seconds{quantile="0"} 1.9521e-05 go_gc_duration_seconds{quantile="0.25"} 4.5717e-05 go_gc_duration_seconds{quantile="0.5"} 7.8573e-05 ...
- On the server where Prometheus is installed, edit the - scrape_configssection of the- /etc/prometheus/prometheus.ymlfile to add a job for the Slurm exporter:- - job_name: slurm-exporter scrape_interval: 30s scrape_timeout: 30s static_configs: - targets: ['MGMTSERVER:8080']- Set the - scrape_intervaland- scrape_timeoutto- 30sto avoid overloading the server.
- Restart the Prometheus service: - monitor#systemctl restart prometheus
- Log in to the Grafana web server at - MNTRNODE:3000.
- In the field, enter the dashboard ID - 4323, then click .
- From the drop-down box, select the Prometheus data source added in Procedure 6.1, “Installing Prometheus and Grafana”, then click . 
- Review the Slurm dashboard. The data might take some time to appear. 
- If you made any changes, click when prompted, optionally describe your changes, then click . 
The Slurm dashboard is now available from the screen in Grafana.
6.2.3 Monitoring compute node performance #
To monitor the performance and health of each compute node, install the Prometheus node exporter to collect performance data, then import a custom node dashboard from the Grafana website to visualize the data. For more information about this dashboard, see https://grafana.com/grafana/dashboards/405.
- Section 6.2.1, “Installing Prometheus and Grafana” is complete. 
- You have internet access and policies that allow you to download the dashboard from the Grafana website. 
- To run commands on multiple nodes at once, - pdshmust be installed on the system your shell is running on, and SSH key authentication must be configured for all of the nodes. For more information, see Section 3.2, “pdsh — parallel remote shell program”.
In this procedure, replace the example node names with the host names or IP addresses of the nodes, and replace MNTRNODE with the host name or IP address of the server where Grafana is installed.
- Install the node exporter on each compute node. You can do this on multiple nodes at once by running the following command: - management#pdsh -R ssh -u root -w "NODE1,NODE2" \ "zypper in -y golang-github-prometheus-node_exporter"
- Enable and start the node exporter. You can do this on multiple nodes at once by running the following command: - management#pdsh -R ssh -u root -w "NODE1,NODE2" \ "systemctl enable --now prometheus-node_exporter"
- Verify that the node exporter works: - In a browser, navigate to - NODE1:9100/metrics, or:
- In a terminal, run the following command: - >wget NODE1:9100/metrics --output-document=-
 - Either of these methods should show output similar to the following: - # HELP go_gc_duration_seconds A summary of the pause duration of garbage collection cycles. # TYPE go_gc_duration_seconds summary go_gc_duration_seconds{quantile="0"} 2.3937e-05 go_gc_duration_seconds{quantile="0.25"} 3.5456e-05 go_gc_duration_seconds{quantile="0.5"} 8.1436e-05 ...
- On the server where Prometheus is installed, edit the - scrape_configssection of the- /etc/prometheus/prometheus.ymlfile to add a job for the node exporter:- - job_name: node-exporter static_configs: - targets: ['NODE1:9100'] - targets: ['NODE2:9100']- Add a target for every node that has the node exporter installed. 
- Restart the Prometheus service: - monitor#systemctl restart prometheus
- Log in to the Grafana web server at - MNTRNODE:3000.
- In the field, enter the dashboard ID - 405, then click .
- From the drop-down box, select the Prometheus data source added in Procedure 6.1, “Installing Prometheus and Grafana”, then click . 
- Review the node dashboard. Click the drop-down box to select the nodes you want to view. The data might take some time to appear. 
- If you made any changes, click when prompted. To keep the currently selected nodes next time you access the dashboard, activate . Optionally describe your changes, then click . 
The node dashboard is now available from the screen in Grafana.
6.3 Ganglia — system monitoring #
Ganglia is a scalable, distributed monitoring system for high-performance computing systems, such as clusters and grids. It is based on a hierarchical design targeted at federations of clusters.
6.3.1 Using Ganglia #
    To use Ganglia, install ganglia-gmetad
    on the management server, then start the Ganglia meta-daemon:
    rcgmead start. To make sure the service is started
    after a reboot, run: systemctl enable gmetad. On
    each cluster node which you want to monitor, install
    ganglia-gmond, start the service rcgmond
    start and make sure it is enabled to start automatically
    after a reboot: systemctl enable gmond. To test
    whether the gmond daemon has
    connected to the
    meta-daemon, run gstat -a and check that each node to
    be monitored is present in the output.
   
6.3.2 Ganglia on Btrfs #
    When using the Btrfs file system, the monitoring data will be lost after
    a rollback of the service gmetad.
    To fix this issue, either install the package
    ganglia-gmetad-skip-bcheck or create the file
    /etc/ganglia/no_btrfs_check.
   
6.3.3 Using the Ganglia Web interface #
    Install ganglia-web on the management server.
    Enable PHP in Apache2: a2enmod php7.
    Then start Apache2 on this machine: rcapache2
    start and make sure it is started automatically after a
    reboot: systemctl enable apache2. The Ganglia Web
    interface is accessible from
    http://MANAGEMENT_SERVER/ganglia.
   
6.4 rasdaemon — utility to log RAS error tracings #
   rasdaemon is an RAS
   (Reliability, Availability and Serviceability) logging tool. It records
   memory errors using EDAC (Error Detection and Correction) tracing events.
   EDAC drivers in the Linux kernel handle detection of ECC (Error Correction
   Code) errors from memory controllers.
  
   rasdaemon can be used on large
   memory systems to track, record, and localize memory errors and how they
   evolve over time to detect hardware degradation. Furthermore, it can be used
   to localize a faulty DIMM on the mainboard.
  
To check whether the EDAC drivers are loaded, run the following command:
# ras-mc-ctl --status
   The command should return ras-mc-ctl: drivers are
   loaded. If it indicates that the drivers are not loaded, EDAC
   may not be supported on your board.
  
   To start rasdaemon, run
   systemctl start rasdaemon.service.
   To start rasdaemon
   automatically at boot time, run systemctl enable
   rasdaemon.service. The daemon logs information to
   /var/log/messages and to an internal database. A
   summary of the stored errors can be obtained with the following command:
  
# ras-mc-ctl --summaryThe errors stored in the database can be viewed with:
# ras-mc-ctl --errors
   Optionally, you can load the DIMM labels silk-screened on the system
   board to more easily identify the faulty DIMM. To do so, before starting
   rasdaemon, run:
  
# systemctl start ras-mc-ctl start
   For this to work, you need to set up a layout description for the board.
   There are no descriptions supplied by default. To add a layout
   description, create a file with an arbitrary name in the directory
   /etc/ras/dimm_labels.d/. The format is:
  
Vendor: MOTHERBOARD-VENDOR-NAME Model: MOTHERBOARD-MODEL-NAME LABEL: MC.TOP.MID.LOW

