This is a draft document that was built and uploaded automatically. It may document beta software and be incomplete or even incorrect. Use this document at your own risk.
The following procedure assumes that there is at least one Compute Node already running. Otherwise, see the section called “Bootstrapping the Compute Plane”.
Procedure 1.1. Procedure for Recovering from Compute Node Failure
If the Compute Node failed, it should have been fenced. Verify that this is
the case. Otherwise, check /var/log/pacemaker.log on
the Designated Coordinator to determine why the Compute Node was not fenced.
The most likely reason is a problem with STONITH devices.
Determine the cause of the Compute Node's failure.
Rectify the root cause.
Boot the Compute Node again.
Check whether the crowbar_join script ran
successfully on the Compute Node. If this is not the case, check the log
files to find out the reason. Refer to
the section called “On All Other Crowbar Nodes” to find the exact
location of the log file.
If the chef-client agent triggered by
crowbar_join succeeded, confirm that the
pacemaker_remote service is up and running.
Check whether the remote node is registered and considered healthy by the
core cluster. If this is not the case check
/var/log/pacemaker.log on the Designated Coordinator
to determine the cause. There should be a remote primitive running on the
core cluster (active/passive). This primitive is responsible for
establishing a TCP connection to the
pacemaker_remote service on port 3121 of the
Compute Node. Ensure that nothing is preventing this particular TCP
connection from being established (for example, problems with NICs,
switches, firewalls etc.). One way to do this is to run the following
commands:
tux >lsof -i tcp:3121tux >tcpdump tcp port 3121
If Pacemaker can communicate with the remote node, it should start the
nova-compute service on it as part of the cloned
group cl-g-nova-compute using the NovaCompute OCF
resource agent. This cloned group will block startup of
nova-evacuate until at least one clone is
started.
A necessary, related but different procedure is described in the section called “Bootstrapping the Compute Plane”.
It may happen that novaCompute has been launched
correctly on the Compute Node by lrmd, but the
openstack-nova-compute service is still not
running. This usually happens when nova-evacuate
did not run correctly.
If nova-evacuate is not
running on one of the core cluster nodes, make sure that the service is
marked as started (target-role="Started"). If this is
the case, then your cloud does not have any Compute Nodes already running as
assumed by this procedure.
If nova-evacuate is started but it is
failing, check the Pacemaker logs to determine the cause.
If nova-evacuate is started and
functioning correctly, it should call nova's
evacuate API to release resources used by the
Compute Node and resurrect elsewhere any VMs that died when it failed.
If openstack-nova-compute is running, but VMs are
not booted on the node, check that the service is not disabled or forced
down using the openstack compute service list
command. In case the service is disabled, run the openstack
compute service set –enable
SERVICE_ID command. If the service is
forced down, run the following commands:
tux >fence_nova_param () { key="$1" cibadmin -Q -A "//primitive[@id="fence-nova"]//nvpair[@name='$key']" | \ sed -n '/.*value="/{s///;s/".*//;p}' }tux >fence_compute \ --auth-url=`fence_nova_param auth-url` \ --endpoint-type=`fence_nova_param endpoint-type` \ --tenant-name=`fence_nova_param tenant-name` \ --domain=`fence_nova_param domain` \ --username=`fence_nova_param login` \ --password=`fence_nova_param passwd` \ -nCOMPUTE_HOSTNAME\ --action=on
The above steps should be performed automatically after the node is booted. If that does not happen, try the following debugging techniques.
Check the evacuate attribute for the Compute Node in the
Pacemaker cluster's attrd service using the
command:
tux >attrd_updater -p -n evacuate -NNODE
Possible results are the following:
The attribute is not set. Refer to Step 1 in Procedure 1.1, “Procedure for Recovering from Compute Node Failure”.
The attribute is set to yes. This means that the
Compute Node was fenced, but nova-evacuate never
initiated the recovery procedure by calling nova's evacuate API.
The attribute contains a time stamp, in which case the recovery procedure was initiated at the time indicated by the time stamp, but has not completed yet.
If the attribute is set to no, the recovery procedure
recovered successfully and the cloud is ready for the Compute Node to
rejoin.
If the attribute is stuck with the wrong value, it can be set to
no using the command:
tux >attrd_updater -n evacuate -U no -NNODE
After standard fencing has been performed, fence agent
fence_compute should activate the secondary
fencing device (fence-nova). It does this by setting
the attribute to yes to mark the node as needing
recovery. The agent also calls nova's
force_down API to notify it that the host is down.
You should be able to see this in
/var/log/nova/fence_compute.log on the node in the core
cluster that was running the fence-nova agent at
the time of fencing. During the recovery, fence_compute
tells nova that the host is up and running again.