Recovering from Compute Node Failure

Recovering from Compute Node Failure
Prev	Chapter 1. Maintenance	Next

The following procedure assumes that there is at least one Compute Node already running. Otherwise, see the section called “Bootstrapping the Compute Plane”.

Procedure 1.1. Procedure for Recovering from Compute Node Failure

If the Compute Node failed, it should have been fenced. Verify that this is the case. Otherwise, check /var/log/pacemaker.log on the Designated Coordinator to determine why the Compute Node was not fenced. The most likely reason is a problem with STONITH devices.
Determine the cause of the Compute Node's failure.
Rectify the root cause.
Boot the Compute Node again.
Check whether the crowbar_join script ran successfully on the Compute Node. If this is not the case, check the log files to find out the reason. Refer to the section called “On All Other Crowbar Nodes” to find the exact location of the log file.
If the chef-client agent triggered by crowbar_join succeeded, confirm that the pacemaker_remote service is up and running.
Check whether the remote node is registered and considered healthy by the core cluster. If this is not the case check /var/log/pacemaker.log on the Designated Coordinator to determine the cause. There should be a remote primitive running on the core cluster (active/passive). This primitive is responsible for establishing a TCP connection to the pacemaker_remote service on port 3121 of the Compute Node. Ensure that nothing is preventing this particular TCP connection from being established (for example, problems with NICs, switches, firewalls etc.). One way to do this is to run the following commands:
```
tux > lsof -i tcp:3121
          tux > tcpdump tcp port 3121
        
```
If Pacemaker can communicate with the remote node, it should start the nova-compute service on it as part of the cloned group cl-g-nova-compute using the NovaCompute OCF resource agent. This cloned group will block startup of nova-evacuate until at least one clone is started.
A necessary, related but different procedure is described in the section called “Bootstrapping the Compute Plane”.
It may happen that novaCompute has been launched correctly on the Compute Node by lrmd, but the openstack-nova-compute service is still not running. This usually happens when nova-evacuate did not run correctly.
If nova-evacuate is not running on one of the core cluster nodes, make sure that the service is marked as started (target-role="Started"). If this is the case, then your cloud does not have any Compute Nodes already running as assumed by this procedure.
If nova-evacuate is started but it is failing, check the Pacemaker logs to determine the cause.
If nova-evacuate is started and functioning correctly, it should call nova's evacuate API to release resources used by the Compute Node and resurrect elsewhere any VMs that died when it failed.

If openstack-nova-compute is running, but VMs are not booted on the node, check that the service is not disabled or forced down using the openstack compute service list command. In case the service is disabled, run the openstack compute service set –enable SERVICE_ID command. If the service is forced down, run the following commands:

tux > fence_nova_param () {
          key="$1"
          cibadmin -Q -A "//primitive[@id="fence-nova"]//nvpair[@name='$key']" | \
          sed -n '/.*value="/{s///;s/".*//;p}'
          }
          tux > fence_compute \
          --auth-url=`fence_nova_param auth-url` \
          --endpoint-type=`fence_nova_param endpoint-type` \
          --tenant-name=`fence_nova_param tenant-name` \
          --domain=`fence_nova_param domain` \
          --username=`fence_nova_param login` \
          --password=`fence_nova_param passwd` \
          -n COMPUTE_HOSTNAME \
          --action=on

The above steps should be performed automatically after the node is booted. If that does not happen, try the following debugging techniques.

Check the evacuate attribute for the Compute Node in the Pacemaker cluster's attrd service using the command:

tux > attrd_updater -p -n evacuate -N NODE

Possible results are the following:

The attribute is not set. Refer to Step 1 in Procedure 1.1, “Procedure for Recovering from Compute Node Failure”.
The attribute is set to yes. This means that the Compute Node was fenced, but nova-evacuate never initiated the recovery procedure by calling nova's evacuate API.
The attribute contains a time stamp, in which case the recovery procedure was initiated at the time indicated by the time stamp, but has not completed yet.
If the attribute is set to no, the recovery procedure recovered successfully and the cloud is ready for the Compute Node to rejoin.

If the attribute is stuck with the wrong value, it can be set to no using the command:

tux > attrd_updater -n evacuate -U no -N NODE

After standard fencing has been performed, fence agent fence_compute should activate the secondary fencing device (fence-nova). It does this by setting the attribute to yes to mark the node as needing recovery. The agent also calls nova's force_down API to notify it that the host is down. You should be able to see this in /var/log/nova/fence_compute.log on the node in the core cluster that was running the fence-nova agent at the time of fencing. During the recovery, fence_compute tells nova that the host is up and running again.

Prev	Up	Next
Upgrading from SUSE OpenStack Cloud Crowbar 8 to SUSE OpenStack Cloud Crowbar 9	Home	Bootstrapping the Compute Plane