13 System Maintenance #
Information about managing and configuring your cloud as well as procedures for performing node maintenance.
This section contains the following sections to help you manage, configure, and maintain your SUSE OpenStack Cloud cloud.
13.1 Planned System Maintenance #
Planned maintenance tasks for your cloud. See sections below for:
13.1.1 Whole Cloud Maintenance #
Planned maintenance procedures for your whole cloud.
13.1.1.1 Bringing Down Your Cloud: Services Down Method #
If you have a planned maintenance and need to bring down your entire cloud, update and reboot all nodes in the cloud one by one. Start with the deployer node, then follow the order recommended in Section 13.1.1.2, “Rolling Reboot of the Cloud”. This method will bring down all of your services.
If you wish to use a method utilizing rolling reboots where your cloud services will continue running then see Section 13.1.1.2, “Rolling Reboot of the Cloud”.
To perform backups prior to these steps, visit the backup and restore pages first at Chapter 14, Backup and Restore.
13.1.1.1.1 Gracefully Bringing Down and Restarting Your Cloud Environment #
You will do the following steps from your Cloud Lifecycle Manager.
Log in to your Cloud Lifecycle Manager.
Gracefully shut down your cloud by running the
ardana-stop.ymlplaybook:cd ~/scratch/ansible/next/ardana/ansible ansible-playbook -i hosts/verb_hosts ardana-stop.yml
Shut down your nodes. You should shut down your controller nodes last and bring them up first after the maintenance.
There are multiple ways you can do this:
You can SSH to each node and use
sudo reboot -fto reboot the node.From the Cloud Lifecycle Manager, you can use the
bm-power-down.ymlandbm-power-up.ymlplaybooks.You can shut down the nodes and then physically restart them either via a power button or the IPMI.
Perform the necessary maintenance.
After the maintenance is complete, power your Cloud Lifecycle Manager back up and then SSH to it.
Determine the current power status of the nodes in your environment:
cd ~/scratch/ansible/next/ardana/ansible ansible-playbook -i hosts/verb_hosts bm-power-status.yml
If necessary, power up any nodes that are not already powered up, ensuring that you power up your controller nodes first. You can target specific nodes with the
-e nodelist=<node_name>switch.cd ~/scratch/ansible/next/ardana/ansible ansible-playbook -i hosts/verb_hosts bm-power-up.yml [-e nodelist=<node_name>]
NoteObtain the
<node_name>by using thesudo cobbler system listcommand from the Cloud Lifecycle Manager.Bring the databases back up:
cd ~/scratch/ansible/next/ardana/ansible ansible-playbook -i hosts/verb_hosts galera-bootstrap.yml
Gracefully bring up your cloud services by running the
ardana-start.ymlplaybook:cd ~/scratch/ansible/next/ardana/ansible ansible-playbook -i hosts/verb_hosts ardana-start.yml
Pause for a few minutes and give the cloud environment time to come up completely and then verify the status of the individual services using this playbook:
cd ~/scratch/ansible/next/ardana/ansible ansible-playbook -i hosts/verb_hosts ardana-status.yml
If any services did not start properly, you can run playbooks for the specific services having issues.
For example:
If RabbitMQ fails, run:
cd ~/scratch/ansible/next/ardana/ansible ansible-playbook -i hosts/verb_hosts rabbitmq-start.yml
You can check the status of RabbitMQ afterwards with this:
cd ~/scratch/ansible/next/ardana/ansible ansible-playbook -i hosts/verb_hosts rabbitmq-status.yml
If the recovery had failed, you can run:
cd ~/scratch/ansible/next/ardana/ansible ansible-playbook -i hosts/verb_hosts rabbitmq-disaster-recovery.yml
Each of the other services have playbooks in the
~/scratch/ansible/next/ardana/ansibledirectory in the format of<service>-start.ymlthat you can run. One example, for the compute service, isnova-start.yml.Continue checking the status of your SUSE OpenStack Cloud 8 cloud services until there are no more failed or unreachable nodes:
cd ~/scratch/ansible/next/ardana/ansible ansible-playbook -i hosts/verb_hosts ardana-status.yml
13.1.1.2 Rolling Reboot of the Cloud #
If you have a planned maintenance and need to bring down your entire cloud and restart services while minimizing downtime, follow the steps here to safely restart your cloud. If you do not mind your services being down, then another option for planned maintenance can be found at Section 13.1.1.1, “Bringing Down Your Cloud: Services Down Method”.
13.1.1.2.1 Recommended node reboot order #
To ensure that rebooted nodes reintegrate into the cluster, the key is having enough time between controller reboots.
The recommended way to achieve this is as follows:
Reboot controller nodes one-by-one with a suitable interval in between. If you alternate between controllers and compute nodes you will gain more time between the controller reboots.
Reboot of compute nodes (if present in your cloud).
Reboot of Swift nodes (if present in your cloud).
Reboot of ESX nodes (if present in your cloud).
13.1.1.2.2 Rebooting controller nodes #
Turn off the Keystone Fernet Token-Signing Key Rotation
Before rebooting any controller node, you need to ensure that the Keystone Fernet token-signing key rotation is turned off. Run the following command:
cd ~/scratch/ansible/next/ardana/ansible ansible-playbook -i hosts/verb_hosts keystone-stop-fernet-auto-rotation.yml
Migrate singleton services first
If you have previously rebooted your Cloud Lifecycle Manager for any reason, ensure that
the apache2 service is running before
continuing. To start the apache2 service, use
this command:
sudo systemctl start apache2
The first consideration before rebooting any controller nodes is that there are a few services that run as singletons (non-HA), thus they will be unavailable while the controller they run on is down. Typically this is a very small window, but if you want to retain the service during the reboot of that server you should take special action to maintain service, such as migrating the service.
For these steps, if your singleton services are running on controller1 and you move them to controller2, then ensure you move them back to controller1 before proceeding to reboot controller2.
For the cinder-volume singleton
service:
Execute the following command on each controller node to determine which node is hosting the cinder-volume singleton. It should be running on only one node:
ps auxww | grep cinder-volume | grep -v grep
Run the cinder-migrate-volume.yml playbook - details
about the Cinder volume and backup migration instructions can be found in
Section 7.1.3, “Managing Cinder Volume and Backup Services”.
For the nova-consoleauth singleton
service:
The nova-consoleauth component runs by default on the
first controller node, that is, the host with
consoleauth_host_index=0. To move it to another
controller node before rebooting controller 0, run the ansible playbook
nova-start.yml and pass it the index of the next
controller node. For example, to move it to controller 2 (index of 1), run:
cd ~/scratch/ansible/next/ardana/ansible ansible-playbook -i hosts/verb_hosts nova-start.yml --extra-vars "consoleauth_host_index=1"
After you run this command you may now see two instances of the
nova-consoleauth service, which will show as being in
disabled state, when you run the nova
service-list command. You can then delete the service using these
steps.
Obtain the service ID for the duplicated nova-consoleauth service:
nova service-list
Example:
$ nova service-list +----+------------------+---------------------------+----------+----------+-------+----------------------------+-----------------+ | Id | Binary | Host | Zone | Status | State | Updated_at | Disabled Reason | +----+------------------+---------------------------+----------+----------+-------+----------------------------+-----------------+ | 1 | nova-conductor | ...a-cp1-c1-m1-mgmt | internal | enabled | up | 2016-08-25T12:11:48.000000 | - | | 10 | nova-conductor | ...a-cp1-c1-m3-mgmt | internal | enabled | up | 2016-08-25T12:11:47.000000 | - | | 13 | nova-conductor | ...a-cp1-c1-m2-mgmt | internal | enabled | up | 2016-08-25T12:11:48.000000 | - | | 16 | nova-scheduler | ...a-cp1-c1-m1-mgmt | internal | enabled | up | 2016-08-25T12:11:39.000000 | - | | 19 | nova-scheduler | ...a-cp1-c1-m2-mgmt | internal | enabled | up | 2016-08-25T12:11:41.000000 | - | | 22 | nova-scheduler | ...a-cp1-c1-m3-mgmt | internal | enabled | up | 2016-08-25T12:11:44.000000 | - | | 25 | nova-consoleauth | ...a-cp1-c1-m1-mgmt | internal | enabled | up | 2016-08-25T12:11:45.000000 | - | | 49 | nova-compute | ...a-cp1-comp0001-mgmt | nova | enabled | up | 2016-08-25T12:11:48.000000 | - | | 52 | nova-compute | ...a-cp1-comp0002-mgmt | nova | enabled | up | 2016-08-25T12:11:41.000000 | - | | 55 | nova-compute | ...a-cp1-comp0003-mgmt | nova | enabled | up | 2016-08-25T12:11:43.000000 | - | | 70 | nova-consoleauth | ...a-cp1-c1-m3-mgmt | internal | disabled | down | 2016-08-25T12:10:40.000000 | - | +----+------------------+---------------------------+----------+----------+-------+----------------------------+-----------------+Delete the disabled duplicate service with this command:
nova service-delete <service_ID>
Given the example in the previous step, the command could be:
nova service-delete 70
For the SNAT namespace singleton service:
If you reboot the controller node hosting the SNAT namespace service on it, Compute instances without floating IPs will lose network connectivity when that controller is rebooted. To prevent this from happening, you can use these steps to determine which controller node is hosting the SNAT namespace service and migrate it to one of the other controller nodes while that node is rebooted.
Locate the SNAT node where the router is providing the active
snat_service:From the Cloud Lifecycle Manager, list out your ports to determine which port is serving as the router gateway:
source ~/service.osrc neutron port-list --device_owner network:router_gateway
Example:
$ neutron port-list --device_owner network:router_gateway +--------------------------------------+------+-------------------+-------------------------------------------------------------------------------------+ | id | name | mac_address | fixed_ips | +--------------------------------------+------+-------------------+-------------------------------------------------------------------------------------+ | 287746e6-7d82-4b2c-914c-191954eba342 | | fa:16:3e:2e:26:ac | {"subnet_id": "f4152001-2500-4ebe-ba9d-a8d6149a50df", "ip_address": "10.247.96.29"} | +--------------------------------------+------+-------------------+-------------------------------------------------------------------------------------+Look at the details of this port to determine what the
binding:host_idvalue is, which will point to the host in which the port is bound to:neutron port-show <port_id>
Example, with the value you need in bold:
$ neutron port-show 287746e6-7d82-4b2c-914c-191954eba342 +-----------------------+--------------------------------------------------------------------------------------------------------------+ | Field | Value | +-----------------------+--------------------------------------------------------------------------------------------------------------+ | admin_state_up | True | | allowed_address_pairs | | | binding:host_id | ardana-cp1-c1-m2-mgmt | | binding:profile | {} | | binding:vif_details | {"port_filter": true, "ovs_hybrid_plug": true} | | binding:vif_type | ovs | | binding:vnic_type | normal | | device_id | e122ea3f-90c5-4662-bf4a-3889f677aacf | | device_owner | network:router_gateway | | dns_assignment | {"hostname": "host-10-247-96-29", "ip_address": "10.247.96.29", "fqdn": "host-10-247-96-29.openstacklocal."} | | dns_name | | | extra_dhcp_opts | | | fixed_ips | {"subnet_id": "f4152001-2500-4ebe-ba9d-a8d6149a50df", "ip_address": "10.247.96.29"} | | id | 287746e6-7d82-4b2c-914c-191954eba342 | | mac_address | fa:16:3e:2e:26:ac | | name | | | network_id | d3cb12a6-a000-4e3e-82c4-ee04aa169291 | | security_groups | | | status | DOWN | | tenant_id | | +-----------------------+--------------------------------------------------------------------------------------------------------------+In this example, the
ardana-cp1-c1-m2-mgmtis the node hosting the SNAT namespace service.
SSH to the node hosting the SNAT namespace service and check the SNAT namespace, specifying the router_id that has the interface with the subnet that you are interested in:
ssh <IP_of_SNAT_namespace_host> sudo ip netns exec snat-<router_ID> bash
Example:
sudo ip netns exec snat-e122ea3f-90c5-4662-bf4a-3889f677aacf bash
Obtain the ID for the L3 Agent for the node hosting your SNAT namespace:
source ~/service.osrc neutron agent-list
Example, with the entry you need given the examples above:
$ neutron agent-list +--------------------------------------+----------------------+--------------------------+-------+----------------+---------------------------+ | id | agent_type | host | alive | admin_state_up | binary | +--------------------------------------+----------------------+--------------------------+-------+----------------+---------------------------+ | 0126bbbf-5758-4fd0-84a8-7af4d93614b8 | DHCP agent | ardana-cp1-c1-m3-mgmt | :-) | True | neutron-dhcp-agent | | 33dec174-3602-41d5-b7f8-a25fd8ff6341 | Metadata agent | ardana-cp1-c1-m2-mgmt | :-) | True | neutron-metadata-agent | | 3bc28451-c895-437b-999d-fdcff259b016 | L3 agent | ardana-cp1-c1-m1-mgmt | :-) | True | neutron-vpn-agent | | 4af1a941-61c1-4e74-9ec1-961cebd6097b | L3 agent | ardana-cp1-comp0001-mgmt | :-) | True | neutron-l3-agent | | 58f01f34-b6ca-4186-ac38-b56ee376ffeb | Loadbalancerv2 agent | ardana-cp1-comp0001-mgmt | :-) | True | neutron-lbaasv2-agent | | 65bcb3a0-4039-4d9d-911c-5bb790953297 | Open vSwitch agent | ardana-cp1-c1-m1-mgmt | :-) | True | neutron-openvswitch-agent | | 6981c0e5-5314-4ccd-bbad-98ace7db7784 | L3 agent | ardana-cp1-c1-m3-mgmt | :-) | True | neutron-vpn-agent | | 7df9fa0b-5f41-411f-a532-591e6db04ff1 | Metadata agent | ardana-cp1-comp0001-mgmt | :-) | True | neutron-metadata-agent | | 92880ab4-b47c-436c-976a-a605daa8779a | Metadata agent | ardana-cp1-c1-m1-mgmt | :-) | True | neutron-metadata-agent | | a209c67d-c00f-4a00-b31c-0db30e9ec661 | L3 agent | ardana-cp1-c1-m2-mgmt | :-) | True | neutron-vpn-agent | | a9467f7e-ec62-4134-826f-366292c1f2d0 | DHCP agent | ardana-cp1-c1-m1-mgmt | :-) | True | neutron-dhcp-agent | | b13350df-f61d-40ec-b0a3-c7c647e60f75 | Open vSwitch agent | ardana-cp1-c1-m2-mgmt | :-) | True | neutron-openvswitch-agent | | d4c07683-e8b0-4a2b-9d31-b5b0107b0b31 | Open vSwitch agent | ardana-cp1-comp0001-mgmt | :-) | True | neutron-openvswitch-agent | | e91d7f3f-147f-4ad2-8751-837b936801e3 | Open vSwitch agent | ardana-cp1-c1-m3-mgmt | :-) | True | neutron-openvswitch-agent | | f33015c8-f4e4-4505-b19b-5a1915b6e22a | DHCP agent | ardana-cp1-c1-m2-mgmt | :-) | True | neutron-dhcp-agent | | fe43c0e9-f1db-4b67-a474-77936f7acebf | Metadata agent | ardana-cp1-c1-m3-mgmt | :-) | True | neutron-metadata-agent | +--------------------------------------+----------------------+--------------------------+-------+----------------+---------------------------+Also obtain the ID for the L3 Agent of the node you are going to move the SNAT namespace service to using the same commands as the previous step.
Use these commands to move the SNAT namespace service, with the
router_idbeing the same value as the ID for router:Remove the L3 Agent for the old host:
neutron l3-agent-router-remove <agent_id_of_snat_namespace_host> <qrouter_uuid>
Example:
$ neutron l3-agent-router-remove a209c67d-c00f-4a00-b31c-0db30e9ec661 e122ea3f-90c5-4662-bf4a-3889f677aacf Removed router e122ea3f-90c5-4662-bf4a-3889f677aacf from L3 agent
Remove the SNAT namespace:
sudo ip netns delete snat-<router_id>
Example:
$ sudo ip netns delete snat-e122ea3f-90c5-4662-bf4a-3889f677aacf
Create a new L3 Agent for the new host:
neutron l3-agent-router-add <agent_id_of_new_snat_namespace_host> <qrouter_uuid>
Example:
$ neutron l3-agent-router-add 3bc28451-c895-437b-999d-fdcff259b016 e122ea3f-90c5-4662-bf4a-3889f677aacf Added router e122ea3f-90c5-4662-bf4a-3889f677aacf to L3 agent
Confirm that it has been moved by listing the details of your port from step 1b above, noting the value of
binding:host_idwhich should be updated to the host you moved your SNAT namespace to:neutron port-show <port_ID>
Example:
$ neutron port-show 287746e6-7d82-4b2c-914c-191954eba342 +-----------------------+--------------------------------------------------------------------------------------------------------------+ | Field | Value | +-----------------------+--------------------------------------------------------------------------------------------------------------+ | admin_state_up | True | | allowed_address_pairs | | | binding:host_id | ardana-cp1-c1-m1-mgmt | | binding:profile | {} | | binding:vif_details | {"port_filter": true, "ovs_hybrid_plug": true} | | binding:vif_type | ovs | | binding:vnic_type | normal | | device_id | e122ea3f-90c5-4662-bf4a-3889f677aacf | | device_owner | network:router_gateway | | dns_assignment | {"hostname": "host-10-247-96-29", "ip_address": "10.247.96.29", "fqdn": "host-10-247-96-29.openstacklocal."} | | dns_name | | | extra_dhcp_opts | | | fixed_ips | {"subnet_id": "f4152001-2500-4ebe-ba9d-a8d6149a50df", "ip_address": "10.247.96.29"} | | id | 287746e6-7d82-4b2c-914c-191954eba342 | | mac_address | fa:16:3e:2e:26:ac | | name | | | network_id | d3cb12a6-a000-4e3e-82c4-ee04aa169291 | | security_groups | | | status | DOWN | | tenant_id | | +-----------------------+--------------------------------------------------------------------------------------------------------------+
Reboot the controllers
In order to reboot the controller nodes, you must first retrieve a list of nodes in your cloud running control plane services.
for i in $(grep -w cluster-prefix ~/openstack/my_cloud/definition/data/control_plane.yml | awk '{print $2}'); do grep $i ~/scratch/ansible/next/ardana/ansible/hosts/verb_hosts | grep ansible_ssh_host | awk '{print $1}'; doneThen perform the following steps from your Cloud Lifecycle Manager for each of your controller nodes:
If any singleton services are active on this node, they will be unavailable while the node is down. If you want to retain the service during the reboot, you should take special action to maintain the service, such as migrating the service as appropriate as noted above.
Stop all services on the controller node that you are rebooting first:
cd ~/scratch/ansible/next/ardana/ansible ansible-playbook -i hosts/verb_hosts ardana-stop.yml --limit <controller node>
Reboot the controller node, e.g. run the following command on the controller itself:
sudo reboot
Note that the current node being rebooted could be hosting the lifecycle manager.
Wait for the controller node to become ssh-able and allow an additional minimum of five minutes for the controller node to settle. Start all services on the controller node:
cd ~/scratch/ansible/next/ardana/ansible ansible-playbook -i hosts/verb_hosts ardana-start.yml --limit <controller node>
Verify that the status of all services on that is OK on the controller node:
cd ~/scratch/ansible/next/ardana/ansible ansible-playbook -i hosts/verb_hosts ardana-status.yml --limit <controller node>
When above start operation has completed successfully, you may proceed to the next controller node. Ensure that you migrate your singleton services off the node first.
It is important that you not begin the reboot procedure for a new controller node until the reboot of the previous controller node has been completed successfully (that is, the ardana-status playbook has completed without error).
Reenable the Keystone Fernet Token-Signing Key Rotation
After all the controller nodes are successfully updated and back online, you
need to re-enable the Keystone Fernet token-signing key rotation job by
running the keystone-reconfigure.yml playbook. On the
deployer, run:
cd ~/scratch/ansible/next/ardana/ansible ansible-playbook -i hosts/verb_hosts keystone-reconfigure.yml
13.1.1.2.3 Rebooting compute nodes #
To reboot a compute node the following operations will need to be performed:
Disable provisioning of the node to take the node offline to prevent further instances being scheduled to the node during the reboot.
Identify instances that exist on the compute node, and then either:
Live migrate the instances off the node before actioning the reboot. OR
Stop the instances
Reboot the node
Restart the Nova services
Disable provisioning:
nova service-disable --reason "<describe reason>" <node name> nova-compute
If the node has existing instances running on it these instances will need to be migrated or stopped prior to re-booting the node.
Live migrate existing instances. Identify the instances on the compute node. Note: The following command must be run with nova admin credentials.
nova list --host <hostname> --all-tenants
Migrate or Stop the instances on the compute node.
Migrate the instances off the node by running one of the following commands for each of the instances:
If your instance is booted from a volume and has any number of Cinder volume attached, use the nova live-migration command:
nova live-migration <instance uuid> [<target compute host>]
If your instance has local (ephemeral) disk(s) only, you can use the --block-migrate option:
nova live-migration --block-migrate <instance uuid> [<target compute host>]
Note: The [<target compute host>] option is optional. If you do not specify a target host then the nova scheduler will choose a node for you.
OR
Stop the instances on the node by running the following command for each of the instances:
nova stop <instance-uuid>
Stop all services on the Compute node:
cd ~/scratch/ansible/next/ardana/ansible ansible-playbook -i hosts/verb_hosts ardana-stop.yml --limit <compute node>
SSH to your Compute nodes and reboot them:
sudo reboot
The operating system cleanly shuts down services and then automatically reboots. If you want to be very thorough, run your backup jobs just before you reboot.
Run the ardana-start.yml playbook from the Cloud Lifecycle Manager. If needed, use the bm-power-up.yml playbook to restart the node. Specify just the node(s) you want to start in the 'nodelist' parameter arguments, that is, nodelist=<node1>[,<node2>][,<node3>].
cd ~/openstack/ardana/ansible ansible-playbook -i hosts/localhost bm-power-up.yml -e nodelist=<compute node>
Execute the ardana-start.yml playbook. Specifying the node(s) you want to start in the 'limit' parameter arguments. This parameter accepts wildcard arguments and also '@<filename>' to process all hosts listed in the file.
cd ~/scratch/ansible/next/ardana/ansible ansible-playbook -i hosts/verb_hosts ardana-start.yml --limit <compute node>
Re-enable provisioning on the node:
nova service-enable <node-name> nova-compute
Restart any instances you stopped.
nova start <instance-uuid>
13.1.1.2.4 Rebooting Swift nodes #
If your Swift services are on controller node, please follow the controller node reboot instructions above.
For a dedicated Swift PAC cluster or Swift Object resource node:
For each Swift host
Stop all services on the Swift node:
cd ~/scratch/ansible/next/ardana/ansible ansible-playbook -i hosts/verb_hosts ardana-stop.yml --limit <Swift node>
Reboot the Swift node by running the following command on the Swift node itself:
sudo reboot
Wait for the node to become ssh-able and then start all services on the Swift node:
cd ~/scratch/ansible/next/ardana/ansible ansible-playbook -i hosts/verb_hosts ardana-start.yml --limit <swift node>
13.1.1.2.5 Get list of status playbooks #
Running the following command will yield a list of status playbooks:
cd ~/scratch/ansible/next/ardana/ansible ls *status*
Here is the list:
ls *status* bm-power-status.yml heat-status.yml logging-producer-status.yml ceilometer-status.yml FND-AP2-status.yml ardana-status.yml FND-CLU-status.yml horizon-status.yml logging-status.yml cinder-status.yml freezer-status.yml ironic-status.yml cmc-status.yml glance-status.yml keystone-status.yml galera-status.yml memcached-status.yml nova-status.yml logging-server-status.yml monasca-status.yml ops-console-status.yml monasca-agent-status.yml neutron-status.yml rabbitmq-status.yml swift-status.yml zookeeper-status.yml
13.1.2 Planned Control Plane Maintenance #
Planned maintenance tasks for controller nodes such as full cloud reboots and replacing controller nodes.
13.1.2.1 Replacing a Controller Node #
This section outlines steps for replacing a controller node in your environment.
For SUSE OpenStack Cloud, you must have three controller nodes. Therefore, adding or removing nodes is not an option. However, if you need to repair or replace a controller node, you may do so by following the steps outlined here. Note that to run any playbooks whatsoever for cloud maintenance, you will always run the steps from the Cloud Lifecycle Manager.
These steps will depend on whether you need to replace a shared lifecycle manager/controller node or whether this is a standalone controller node.
Keep in mind while performing the following tasks:
Do not add entries for a new server. Instead, update the entries for the broken one.
Be aware that all management commands are run on the node where the Cloud Lifecycle Manager is running.
13.1.2.1.2 Replacing a Standalone Controller Node #
If the controller node you need to replace is not also being used as the Cloud Lifecycle Manager, follow the steps below.
Log in to the Cloud Lifecycle Manager.
Update your cloud model, specifically the
servers.ymlfile, with the newmac-addr,ilo-ip,ilo-password, andilo-userfields where these have changed. Do not change theid,ip-addr,role, orserver-groupsettings.Commit your configuration to the Book “Installing with Cloud Lifecycle Manager”, Chapter 10 “Using Git for Configuration Management”, as follows:
cd ~/openstack/ardana/ansible git add -A git commit -m "My config or other commit message"
Run the configuration processor:
cd ~/openstack/ardana/ansible ansible-playbook -i hosts/localhost config-processor-run.yml
Update your deployment directory:
cd ~/openstack/ardana/ansible ansible-playbook -i hosts/localhost ready-deployment.yml
Remove the old controller node(s) from Cobbler. You can list out the systems in Cobbler currently with this command:
sudo cobbler system list
and then remove the old controller nodes with this command:
sudo cobbler system remove --name <node>
Remove the SSH key of the old controller node from the known hosts file. You will specify the
ip-addrvalue:ssh-keygen -f "~/.ssh/known_hosts" -R <ip_addr>
You should see a response similar to this one:
ardana@ardana-cp1-c1-m1-mgmt:~/openstack/ardana/ansible$ ssh-keygen -f "~/.ssh/known_hosts" -R 10.13.111.135 # Host 10.13.111.135 found: line 6 type ECDSA ~/.ssh/known_hosts updated. Original contents retained as ~/.ssh/known_hosts.old
Run the cobbler-deploy playbook to add the new controller node:
cd ~/openstack/ardana/ansible ansible-playbook -i hosts/localhost cobbler-deploy.yml
Image the new node(s) by using the bm-reimage playbook. You will specify the name for the node in Cobbler in the command:
cd ~/openstack/ardana/ansible ansible-playbook -i hosts/localhost bm-reimage.yml -e nodelist=<node-name>
ImportantYou must ensure that the old controller node is powered off before completing this step. This is because the new controller node will re-use the original IP address.
Configure the necessary keys used for the database etc:
cd ~/scratch/ansible/next/ardana/ansible ansible-playbook -i hosts/verb_hosts monasca-rebuild-pretasks.yml
Run osconfig on the replacement controller node. For example:
cd ~/scratch/ansible/next/ardana/ansible ansible-playbook -i hosts/verb_hosts osconfig-run.yml -e rebuild=True --limit=<controller-hostname>
If the controller being replaced is the Swift ring builder (see Section 15.6.2.4, “Identifying the Swift Ring Building Server”) you need to restore the Swift ring builder files to the
/etc/swiftlm/CLOUD-NAME/CONTROL-PLANE-NAME/builder_dirdirectory. See Section 15.6.2.7, “Recovering Swift Builder Files” for details.Run the ardana-deploy playbook on the replacement controller.
If the node being replaced is the Swift ring builder server then you only need to use the
--limitswitch for that node, otherwise you need to specify the hostname of your Swift ringer builder server and the hostname of the node being replaced.cd ~/scratch/ansible/next/ardana/ansible ansible-playbook -i hosts/verb_hosts ardana-deploy.yml -e rebuild=True --limit=<controller-hostname>,<swift-ring-builder-hostname>
ImportantIf you receive a Keystone failure when running this playbook, it is likely due to Fernet keys being out of sync. This problem can be corrected by running the
keystone-reconfigure.ymlplaybook to re-sync the Fernet keys.In this situation, do not use the
--limitoption when runningkeystone-reconfigure.yml. In order to re-sync Fernet keys, all the controller nodes must be in the play.ardana >cd ~/scratch/ansible/next/ardana/ansibleardana >ansible-playbook -i hosts/verb_hosts keystone-reconfigure.ymlImportantIf you receive a RabbitMQ failure when running this playbook, review Section 15.2.1, “Understanding and Recovering RabbitMQ after Failure” for how to resolve the issue and then re-run the ardana-deploy playbook.
During the replacement of the node there will be alarms that show up during the process. If those do not clear after the node is back up and healthy, restart the threshold engine by running the following playbooks:
cd ~/scratch/ansible/next/ardana/ansible ansible-playbook -i hosts/verb_hosts monasca-stop.yml --tags thresh ansible-playbook -i hosts/verb_hosts monasca-start.yml --tags thresh
13.1.3 Planned Compute Maintenance #
Planned maintenance tasks for compute nodes.
13.1.3.1 Planned Maintenance for a Compute Node #
If one or more of your compute nodes needs hardware maintenance and you can schedule a planned maintenance then this procedure should be followed.
13.1.3.1.1 Performing planned maintenance on a compute node #
If you have planned maintenance to perform on a compute node, you have to take it offline, repair it, and restart it. To do so, follow these steps:
Log in to the Cloud Lifecycle Manager.
Source the administrator credentials:
source ~/service.osrc
Obtain the hostname for your compute node, which you will use in subsequent commands when
<hostname>is requested:nova host-list | grep compute
The following example shows two compute nodes:
$ nova host-list | grep compute | ardana-cp1-comp0001-mgmt | compute | AZ1 | | ardana-cp1-comp0002-mgmt | compute | AZ2 |
Disable provisioning on the compute node, which will prevent additional instances from being spawned on it:
nova service-disable --reason "Maintenance mode" <hostname> nova-compute
NoteMake sure you re-enable provisioning after the maintenance is complete if you want to continue to be able to spawn instances on the node. You can do this with the command:
nova service-enable <hostname> nova-compute
At this point you have two choices:
Live migration: This option enables you to migrate the instances off the compute node with minimal downtime so you can perform the maintenance without risk of losing data.
Stop/start the instances: Issuing
nova stopcommands to each of the instances will halt them. This option lets you do maintenance and then start the instances back up, as long as no disk failures occur on the compute node data disks. This method involves downtime for the length of the maintenance.
If you choose the live migration route, See Section 13.1.3.3, “Live Migration of Instances” for more details. Skip to step #6 after you finish live migration.
If you choose the stop start method, continue on.
List all of the instances on the node so you can issue stop commands to them:
nova list --host <hostname> --all-tenants
Issue the
nova stopcommand against each of the instances:nova stop <instance uuid>
Confirm that the instances are stopped. If stoppage was successful you should see the instances in a
SHUTOFFstate, as shown here:$ nova list --host ardana-cp1-comp0002-mgmt --all-tenants +--------------------------------------+-----------+----------------------------------+---------+------------+-------------+-----------------------+ | ID | Name | Tenant ID | Status | Task State | Power State | Networks | +--------------------------------------+-----------+----------------------------------+---------+------------+-------------+-----------------------+ | ef31c453-f046-4355-9bd3-11e774b1772f | instance1 | 4365472e025c407c8d751fc578b7e368 | SHUTOFF | - | Shutdown | demo_network=10.0.0.5 | +--------------------------------------+-----------+----------------------------------+---------+------------+-------------+-----------------------+Do your required maintenance. If this maintenance does not take down the disks completely then you should be able to list the instances again after the repair and confirm that they are still in their
SHUTOFFstate:nova list --host <hostname> --all-tenants
Start the instances back up using this command:
nova start <instance uuid>
Example:
$ nova start ef31c453-f046-4355-9bd3-11e774b1772f Request to start server ef31c453-f046-4355-9bd3-11e774b1772f has been accepted.
Confirm that the instances started back up. If restarting is successful you should see the instances in an
ACTIVEstate, as shown here:$ nova list --host ardana-cp1-comp0002-mgmt --all-tenants +--------------------------------------+-----------+----------------------------------+--------+------------+-------------+-----------------------+ | ID | Name | Tenant ID | Status | Task State | Power State | Networks | +--------------------------------------+-----------+----------------------------------+--------+------------+-------------+-----------------------+ | ef31c453-f046-4355-9bd3-11e774b1772f | instance1 | 4365472e025c407c8d751fc578b7e368 | ACTIVE | - | Running | demo_network=10.0.0.5 | +--------------------------------------+-----------+----------------------------------+--------+------------+-------------+-----------------------+If the
nova startfails, you can try doing a hard reboot:nova reboot --hard <instance uuid>
If this does not resolve the issue you may want to contact support.
Reenable provisioning when the node is fixed:
nova service-enable <hostname> nova-compute
13.1.3.2 Rebooting a Compute Node #
If all you need to do is reboot a Compute node, the following steps can be used.
You can choose to live migrate all Compute instances off the node prior to the reboot. Any instances that remain will be restarted when the node is rebooted. This playbook will ensure that all services on the Compute node are restarted properly.
Log in to the Cloud Lifecycle Manager.
Reboot the Compute node(s) with the following playbook.
You can specify either single or multiple Compute nodes using the
--limitswitch.An optional reboot wait time can also be specified. If no reboot wait time is specified it will default to 300 seconds.
cd ~/scratch/ansible/next/ardana/ansible ansible-playbook -i hosts/verb_hosts nova-compute-reboot.yml --limit [compute_node_or_list] [-e nova_reboot_wait_timeout=(seconds)]
NoteIf the Compute node fails to reboot, you should troubleshoot this issue separately as this playbook will not attempt to recover after a failed reboot.
13.1.3.3 Live Migration of Instances #
Live migration allows you to move active compute instances between compute nodes, allowing for less downtime during maintenance.
SUSE OpenStack Cloud Nova offers a set of commands that allow you to move compute instances between compute hosts. Which command you use will depend on the state of the host, what operating system is on the host, what type of storage the instances are using, and whether you want to migrate a single instance or all of the instances off of the host. We will describe these options on this page as well as give you step-by-step instructions for performing them.
13.1.3.3.1 Migration Options #
If your compute node has failed
A compute host failure could be caused by hardware failure, such as the data disk needing to be replaced, power has been lost, or any other type of failure which requires that you replace the baremetal host. In this scenario, the instances on the compute node are unrecoverable and any data on the local ephemeral storage is lost. If you are utilizing block storage volumes, either as a boot device or as additional storage, they should be unaffected.
In these cases you will want to use one of the Nova evacuate commands, which will cause Nova to rebuild the instances on other hosts.
This table describes each of the evacuate options for failed compute nodes:
| Command | Description |
|---|---|
|
|
This command is used to evacuate a single instance from a failed host. You specify the compute instance UUID and the target host you want to evacuate it to. If no host is specified then the Nova scheduler will choose one for you.
See |
|
|
This command is used to evacuate all instances from a failed host. You specify the hostname of the compute host you want to evacuate. Optionally you can specify a target host. If no target host is specified then the Nova scheduler will choose a target host for each instance.
See |
If your compute host is active, powered on and the data disks are in working order you can utilize the migration commands to migrate your compute instances. There are two migration features, "cold" migration (also referred to simply as "migration") and live migration. Migration and live migration are two different functions.
Cold migration is used to copy an instances
data in a SHUTOFF status from one compute host to
another. It does this using passwordless SSH access which has security
concerns associated with it. For this reason, the nova
migrate function has been disabled by default but you have the
ability to enable this feature if you would like. Details on how to do this
can be found in Section 5.4, “Enabling the Nova Resize and Migrate Features”.
Live migration can be performed on
instances in either an ACTIVE or
PAUSED state and uses the QEMU hypervisor to manage the
copy of the running processes and associated resources to the destination
compute host using the hypervisors own protocol and thus is a more secure
method and allows for less downtime. There may be a short network outage,
usually a few milliseconds but could be up to a few seconds if your compute
instances are busy, during a live migration. Also there may be some
performance degredation during the process.
The compute host must remain powered on during the migration process.
Both the cold migration and live migration options will honor Nova group policies, which includes affinity settings. There is a limitation to keep in mind if you use group policies and that is discussed in the Section 13.1.3.3, “Live Migration of Instances” section.
This table describes each of the migration options for active compute nodes:
| Command | Description | SLES |
|---|---|---|
|
|
Used to cold migrate a single instance from a compute host. The
This command will work against instances in an
See the difference between cold migration and live migration at the start of this section. | |
|
|
Used to cold migrate all instances off a specified host to other
available hosts, chosen by the
This command will work against instances in an
See the difference between cold migration and live migration at the start of this section. | |
|
|
Used to migrate a single instance between two compute hosts. You can optionally specify a target host or you can allow the nova scheduler to choose a host for you. If you choose to specify a target host, ensure that the target host has enough resources to host the instance prior to live migration. Works for instances that are booted from a block storage volume or that have any number of block storage volumes attached.
This command works against instances in | X |
|
|
Used to migrate a single instance between two compute hosts. You can optionally specify a target host or you can allow the nova scheduler to choose a host for you. If you choose to specify a target host, ensure that the target host has enough resources to host the instance prior to live migration. Works for instances that are booted from local (ephemeral) disk(s) only or if your instance has a mix of ephemeral disk(s) and block storage volume(s) but are not booted from a block storage volume.
This command works against instances in | X |
|
|
Used to live migrate all instances off of a compute host. You can optionally specify a target host or you can allow the nova scheduler to choose a host for you. If you choose to specify a target host, ensure that the target host has enough resources to host the instance prior to live migration. Works for instances that are booted from a block storage volume or that have any number of block storage volumes attached.
This command works against instances in | X |
|
|
Used to live migrate all instances off of a compute host. You can optionally specify a target host or you can allow the nova scheduler to choose a host for you. If you choose to specify a target host, ensure that the target host has enough resources to host the instance prior to live migration. Works for instances that are booted from local (ephemeral) disk(s) only or if your instance has a mix of ephemeral disk(s) and block storage volume(s) but are not booted from a block storage volume.
This command works against instances in | X |
13.1.3.3.2 Limitations of these Features #
There are limitations that may impact your use of this feature:
To use live migration, your compute instances must be in either an
ACTIVEorPAUSEDstate on the compute host. If you have instances in aSHUTOFFstate then cold migration should be used.Instances in a
Pausedstate cannot be live migrated using the Horizon dashboard. You will need to utilize the NovaClient CLI to perform these.Both cold migration and live migration honor an instance's group policies. If you are utilizing an affinity policy and are migrating multiple instances you may run into an error stating no hosts are available to migrate to. To work around this issue you should specify a target host when migrating these instances, which will bypass the
nova-scheduler. You should ensure that the target host you choose has the resources available to host the instances.The
nova host-evacuate-livecommand will produce an error if you have a compute host that has a mix of instances that use local ephemeral storage and instances that are booted from a block storage volume or have any number of block storage volumes attached. If you have a mix of these instance types, you may need to run the command twice, utilizing the--block-migrateoption. This is described in further detail in Section 13.1.3.3, “Live Migration of Instances”.Instances on KVM hosts can only be live migrated to other KVM hosts.
The migration options described in this document are not available on ESX compute hosts.
Ensure that you read and take into account any other limitations that exist in the release notes. See the release notes for more details.
13.1.3.3.3 Performing a Live Migration #
Cloud administrators can perform a migration on an instance using either the
Horizon dashboard, API, or CLI. Instances in a Paused
state cannot be live migrated using the Horizon GUI. You will need to
utilize the CLI to perform these.
We have documented different scenarios:
13.1.3.3.4 Migrating instances off of a failed compute host #
Log in to the Cloud Lifecycle Manager.
If the compute node is not already powered off, do so with this playbook:
cd ~/openstack/ardana/ansible ansible-playbook -i hosts/localhost bm-power-down.yml -e nodelist=<node_name>
NoteThe value for
<node_name>will be the name that Cobbler has when you runsudo cobbler system listfrom the Cloud Lifecycle Manager.Source the admin credentials necessary to run administrative commands against the Nova API:
source ~/service.osrc
Force the
nova-computeservice to go down on the compute node:nova service-force-down HOSTNAME nova-compute
NoteThe value for HOSTNAME can be obtained by using
nova host-listfrom the Cloud Lifecycle Manager.Evacuate the instances off of the failed compute node. This will cause the nova-scheduler to rebuild the instances on other valid hosts. Any local ephemeral data on the instances is lost.
For single instances on a failed host:
nova evacuate <instance_uuid> <target_hostname>
For all instances on a failed host:
nova host-evacuate <hostname> [--target_host <target_hostname>]
When you have repaired the failed node and start it back up again, when the
nova-computeprocess starts again, it will clean up the evacuated instances.
13.1.3.3.5 Migrating instances off of an active compute host #
Migrating instances using the Horizon dashboard
The Horizon dashboard offers a GUI method for performing live migrations.
Instances in a Paused state will not provide you the live
migration option in Horizon so you will need to use the CLI instructions in
the next section to perform these.
Log into the Horizon dashboard with admin credentials.
Navigate to the menu › › .
Next to the instance you want to migrate, select the drop down menu and choose the option.
In the Live Migrate wizard you will see the compute host the instance currently resides on and then a drop down menu that allows you to choose the compute host you want to migrate the instance to. Select a destination host from that menu. You also have two checkboxes for additional options, which are described below:
- If this is not checked then the value will be
False. If you check this box then it will allow you to override the check that occurs to ensure the destination host has the available disk space to host the instance.- If this is not checked then the value will be
False. If you check this box then it will migrate the local disks by using block migration. Use this option if you are only using ephemeral storage on your instances. If you are using block storage for your instance then ensure this box is not checked.To begin the live migration, click .
Migrating instances using the NovaClient CLI
To perform migrations from the command-line, use the NovaClient.
The Cloud Lifecycle Manager node in your cloud environment should have
the NovaClient already installed. If you will be accessing your environment
through a different method, ensure that the NovaClient is
installed. You can do so using Python's pip package
manager.
To run the commands in the steps below, you need administrator
credentials. From the Cloud Lifecycle Manager, you can source the
service.osrc file which is provided that has the
necessary credentials:
source ~/service.osrc
Here are the steps to perform:
Log in to the Cloud Lifecycle Manager.
Identify the instances on the compute node you wish to migrate:
nova list --all-tenants --host <hostname>
Example showing a host with a single compute instance on it:
ardana >nova list --host ardana-cp1-comp0001-mgmt --all-tenants +--------------------------------------+------+----------------------------------+--------+------------+-------------+-----------------------+ | ID | Name | Tenant ID | Status | Task State | Power State | Networks | +--------------------------------------+------+----------------------------------+--------+------------+-------------+-----------------------+ | 553ba508-2d75-4513-b69a-f6a2a08d04e3 | test | 193548a949c146dfa1f051088e141f0b | ACTIVE | - | Running | adminnetwork=10.0.0.5 | +--------------------------------------+------+----------------------------------+--------+------------+-------------+-----------------------+When using live migration you can either specify a target host that the instance will be migrated to or you can omit the target to allow the nova-scheduler to choose a node for you. If you want to get a list of available hosts you can use this command:
nova host-list
Migrate the instance(s) on the compute node using the notes below.
If your instance is booted from a block storage volume or has any number of block storage volumes attached, use the
nova live-migrationcommand with this syntax:nova live-migration <instance uuid> [<target compute host>]
If your instance has local (ephemeral) disk(s) only or if your instance has a mix of ephemeral disk(s) and block storage volume(s), you should use the
--block-migrateoption:nova live-migration --block-migrate <instance uuid> [<target compute host>]
NoteThe
[<target compute host>]option is optional. If you do not specify a target host then the nova scheduler will choose a node for you.Multiple instances
If you want to live migrate all of the instances off a single compute host you can utilize the
nova host-evacuate-livecommand.Issue the host-evacuate-live command, which will begin the live migration process.
If all of the instances on the host are using at least one local (ephemeral) disk, you should use this syntax:
nova host-evacuate-live --block-migrate <hostname>
Alternatively, if all of the instances are only using block storage volumes then omit the
--block-migrateoption:nova host-evacuate-live <hostname>
NoteYou can either let the nova-scheduler choose a suitable target host or you can specify one using the
--target-host <hostname>switch. Seenova help host-evacuate-livefor details.
13.1.3.3.6 Troubleshooting migration or host evacuate issues #
Issue: When attempting to use nova
host-evacuate-live against a node, you receive the error below:
$ nova host-evacuate-live ardana-cp1-comp0001-mgmt --target-host ardana-cp1-comp0003-mgmt +--------------------------------------+-------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | Server UUID | Live Migration Accepted | Error Message | +--------------------------------------+-------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | 95a7ded8-ebfc-4848-9090-2df378c88a4c | False | Error while live migrating instance: ardana-cp1-comp0001-mgmt is not on shared storage: Live migration can not be used without shared storage except a booted from volume VM which does not have a local disk. (HTTP 400) (Request-ID: req-9fd79670-a780-40ed-a515-c14e28e0a0a7) | | 13ab4ef7-0623-4d00-bb5a-5bb2f1214be4 | False | Error while live migrating instance: ardana-cp1-comp0001-mgmt is not on shared storage: Live migration cannot be used without shared storage except a booted from volume VM which does not have a local disk. (HTTP 400) (Request-ID: req-26834267-c3ec-4f8b-83cc-5193d6a394d6) | +--------------------------------------+-------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
Fix: This occurs when you are attempting to
live evacuate a host that contains instances booted from local storage and
you are not specifying --block-migrate in your command.
Re-attempt the live evacuation with this syntax:
nova host-evacuate-live --block-migrate <hostname> [--target-host <target_hostname>]
Issue: When attempting to use nova
host-evacuate-live against a node, you receive the error below:
$ nova host-evacuate-live --block-migrate ardana-cp1-comp0001-mgmt --target-host ardana-cp1-comp0003-mgmt +--------------------------------------+-------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | Server UUID | Live Migration Accepted | Error Message | +--------------------------------------+-------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | e9874122-c5dc-406f-9039-217d9258c020 | False | Error while live migrating instance: ardana-cp1-comp0001-mgmt is not on local storage: Block migration can not be used with shared storage. (HTTP 400) (Request-ID: req-60b1196e-84a0-4b71-9e49-96d6f1358e1a) | | 84a02b42-9527-47ac-bed9-8fde1f98e3fe | False | Error while live migrating instance: ardana-cp1-comp0001-mgmt is not on local storage: Block migration can not be used with shared storage. (HTTP 400) (Request-ID: req-0cdf1198-5dbd-40f4-9e0c-e94aa1065112) | +--------------------------------------+-------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
Fix: This occurs when you are attempting to
live evacuate a host that contains instances booted from a block storage
volume and you are specifying --block-migrate in your
command. Re-attempt the live evacuation with this syntax:
nova host-evacuate-live <hostname> [--target-host <target_hostname>]
Issue: When attempting to use nova
live-migration against an instance, you receive the error below:
$ nova live-migration 2a13ffe6-e269-4d75-8e46-624fec7a5da0 ardana-cp1-comp0002-mgmt ERROR (BadRequest): ardana-cp1-comp0001-mgmt is not on shared storage: Live migration can not be used without shared storage except a booted from volume VM which does not have a local disk. (HTTP 400) (Request-ID: req-158dd415-0bb7-4613-8529-6689265387e7)
Fix: This occurs when you are attempting to
live migrate an instance that was booted from local storage and you are not
specifying --block-migrate in your command. Re-attempt
the live migration with this syntax:
nova live-migration --block-migrate <instance_uuid> <target_hostname>
Issue: When attempting to use nova
live-migration against an instance, you receive the error below:
$ nova live-migration --block-migrate 84a02b42-9527-47ac-bed9-8fde1f98e3fe ardana-cp1-comp0001-mgmt ERROR (BadRequest): ardana-cp1-comp0002-mgmt is not on local storage: Block migration can not be used with shared storage. (HTTP 400) (Request-ID: req-51fee8d6-6561-4afc-b0c9-7afa7dc43a5b)
Fix: This occurs when you are attempting to
live migrate an instance that was booted from a block storage volume and you
are specifying --block-migrate in your command.
Re-attempt the live migration with this syntax:
nova live-migration <instance_uuid> <target_hostname>
13.1.3.4 Adding Compute Node #
Adding a Compute Node allows you to add capacity.
13.1.3.4.1 Adding a SLES Compute Node #
Adding a SLES compute node allows you to add additional capacity for more virtual machines.
You may have a need to add additional SLES compute hosts for more virtual machine capacity or another purpose and these steps will help you achieve this.
There are two methods you can use to add SLES compute hosts to your environment:
Adding SLES pre-installed compute hosts. This method does not require the SLES ISO be on the Cloud Lifecycle Manager to complete.
Using the provided Ansible playbooks and Cobbler, SLES will be installed on your new compute hosts. This method requires that you provided a SUSE Linux Enterprise Server 12 SP3 ISO during the initial installation of your cloud, following the instructions at Book “Installing with Cloud Lifecycle Manager”, Chapter 19 “Installing SLES Compute”, Section 19.1 “SLES Compute Node Installation Overview”.
If you want to use the provided Ansible playbooks and Cobbler to setup and configure your SLES hosts and you did not have the SUSE Linux Enterprise Server 12 SP3 ISO on your Cloud Lifecycle Manager during your initial installation then ensure you look at the note at the top of that section before proceeding.
13.1.3.4.1.1 Prerequisites #
You need to ensure your input model files are properly setup for SLES compute host clusters. This must be done during the installation process of your cloud and is discussed further at Book “Installing with Cloud Lifecycle Manager”, Chapter 19 “Installing SLES Compute”, Section 19.3 “Using the Cloud Lifecycle Manager to Deploy SLES Compute Nodes” and Book “Planning an Installation with Cloud Lifecycle Manager”, Chapter 10 “Modifying Example Configurations for Compute Nodes”, Section 10.1 “SLES Compute Nodes”.
13.1.3.4.1.2 Adding a SLES compute node #
Adding pre-installed SLES compute hosts
This method requires that you have SUSE Linux Enterprise Server 12 SP3 pre-installed on the baremetal host prior to beginning these steps.
Ensure you have SUSE Linux Enterprise Server 12 SP3 pre-installed on your baremetal host.
Log in to the Cloud Lifecycle Manager.
Edit your
~/openstack/my_cloud/definition/data/servers.ymlfile to include the details about your new compute host(s).For example, if you already had a cluster of three SLES compute hosts using the
SLES-COMPUTE-ROLErole and needed to add a fourth one you would add your details to the bottom of the file in the format. Note that we left out the IPMI details because they will not be needed since you pre-installed the SLES OS on your host(s).- id: compute4 ip-addr: 192.168.102.70 role: SLES-COMPUTE-ROLE server-group: RACK1
You can find detailed descriptions of these fields in Book “Planning an Installation with Cloud Lifecycle Manager”, Chapter 6 “Configuration Objects”, Section 6.5 “Servers”. Ensure that you use the same role for any new SLES hosts you are adding as you specified on your existing SLES hosts.
ImportantYou will need to verify that the
ip-addrvalue you choose for this host does not conflict with any other IP address in your cloud environment. You can confirm this by checking the~/openstack/my_cloud/info/address_info.ymlfile on your Cloud Lifecycle Manager.In your
~/openstack/my_cloud/definition/data/control_plane.ymlfile you will need to check the values formember-count,min-count, andmax-count. If you specified them, ensure that they match up with your new total node count. For example, if you had previously specifiedmember-count: 3and are adding a fourth compute node, you will need to change that value tomember-count: 4.See for Book “Planning an Installation with Cloud Lifecycle Manager”, Chapter 6 “Configuration Objects”, Section 6.2 “Control Plane” more details.
Commit the changes to git:
git add -A git commit -a -m "Add node <name>"
Run the configuration processor and resolve any errors that are indicated:
cd ~/openstack/ardana/ansible ansible-playbook -i hosts/localhost config-processor-run.yml
Update your deployment directory:
cd ~/openstack/ardana/ansible ansible-playbook -i hosts/localhost ready-deployment.yml
Before proceeding, you may want to take a look at info/server_info.yml to see if the assignment of the node you have added is what you expect. It may not be, as nodes will not be numbered consecutively if any have previously been removed. This is to prevent loss of data; the config processor retains data about removed nodes and keeps their ID numbers from being reallocated. See Book “Planning an Installation with Cloud Lifecycle Manager”, Chapter 7 “Other Topics”, Section 7.3 “Persisted Data”, Section 7.3.1 “Persisted Server Allocations” for information on how this works.
[OPTIONAL] - Run the
wipe_disks.ymlplaybook to ensure all of your non-OS partitions on your nodes are completely wiped prior to continuing with the installation.NoteThe
wipe_disks.ymlplaybook is only meant to be run on systems immediately after runningbm-reimage.yml. If used for any other case, it may not wipe all of the expected partitions.The location of
hostnameis~/scratch/ansible/next/ardana/ansible/hosts.cd ~/scratch/ansible/next/ardana/ansible/ ansible-playbook -i hosts/verb_hosts wipe_disks.yml --limit <hostname>
Complete the compute host deployment with this playbook:
cd ~/scratch/ansible/next/ardana/ansible/ ansible-playbook -i hosts/verb_hosts site.yml --tag "generate_hosts_file" ansible-playbook -i hosts/verb_hosts site.yml --limit <hostname>
Adding SLES compute hosts with Ansible playbooks and Cobbler
These steps will show you how to add the new SLES compute host to your
servers.yml file and then run the playbooks that update
your cloud configuration. You will run these playbooks from the lifecycle
manager.
If you did not have the SUSE Linux Enterprise Server 12 SP3 ISO available on your Cloud Lifecycle Manager during your initial installation, it must be installed before proceeding further. Instructions can be found in Book “Installing with Cloud Lifecycle Manager”, Chapter 19 “Installing SLES Compute”.
When you are prepared to continue, use these steps:
Log in to your Cloud Lifecycle Manager.
Checkout the
sitebranch of your local git so you can begin to make the necessary edits:cd ~/openstack/my_cloud/definition/data git checkout site
Edit your
~/openstack/my_cloud/definition/data/servers.ymlfile to include the details about your new compute host(s).For example, if you already had a cluster of three SLES compute hosts using the
SLES-COMPUTE-ROLErole and needed to add a fourth one you would add your details to the bottom of the file in this format:- id: compute4 ip-addr: 192.168.102.70 role: SLES-COMPUTE-ROLE server-group: RACK1 mac-addr: e8:39:35:21:32:4e ilo-ip: 10.1.192.36 ilo-password: password ilo-user: admin distro-id: sles12sp3-x86_64
You can find detailed descriptions of these fields in Book “Planning an Installation with Cloud Lifecycle Manager”, Chapter 6 “Configuration Objects”, Section 6.5 “Servers”. Ensure that you use the same role for any new SLES hosts you are adding as you specified on your existing SLES hosts.
ImportantYou will need to verify that the
ip-addrvalue you choose for this host does not conflict with any other IP address in your cloud environment. You can confirm this by checking the~/openstack/my_cloud/info/address_info.ymlfile on your Cloud Lifecycle Manager.In your
~/openstack/my_cloud/definition/data/control_plane.ymlfile you will need to check the values formember-count,min-count, andmax-count. If you specified them, ensure that they match up with your new total node count. For example, if you had previously specifiedmember-count: 3and are adding a fourth compute node, you will need to change that value tomember-count: 4.See Book “Planning an Installation with Cloud Lifecycle Manager”, Chapter 6 “Configuration Objects”, Section 6.2 “Control Plane” for more details.
Commit the changes to git:
git add -A git commit -a -m "Add node <name>"
Run the configuration processor and resolve any errors that are indicated:
cd ~/openstack/ardana/ansible ansible-playbook -i hosts/localhost config-processor-run.yml
The following playbook confirms that your servers are accessible over their IPMI ports.
cd ~/openstack/ardana/ansible ansible-playbook -i hosts/localhost bm-power-status.yml -e nodelist=compute4
Add the new node into Cobbler:
cd ~/openstack/ardana/ansible ansible-playbook -i hosts/localhost cobbler-deploy.yml
Run the following playbook, ensuring that you specify only your UEFI SLES nodes using the nodelist. This playbook will reconfigure Cobbler for the nodes listed.
ardana >cd ~/scratch/ansible/next/ardana/ansibleardana >ansible-playbook prepare-sles-grub2.yml -e nodelist=node1[,node2,node3]Then you can image the node:
cd ~/openstack/ardana/ansible ansible-playbook -i hosts/localhost bm-reimage.yml -e nodelist=<node name>
NoteIf you do not know the
<node name>, you can get it by usingsudo cobbler system list.Before proceeding, you may want to take a look at info/server_info.yml to see if the assignment of the node you have added is what you expect. It may not be, as nodes will not be numbered consecutively if any have previously been removed. This is to prevent loss of data; the config processor retains data about removed nodes and keeps their ID numbers from being reallocated. See Book “Planning an Installation with Cloud Lifecycle Manager”, Chapter 7 “Other Topics”, Section 7.3 “Persisted Data”, Section 7.3.1 “Persisted Server Allocations” for information on how this works.
Update your deployment directory:
cd ~/openstack/ardana/ansible ansible-playbook -i hosts/localhost ready-deployment.yml
[OPTIONAL] - Run the
wipe_disks.ymlplaybook to ensure all of your non-OS partitions on your hosts are completely wiped prior to continuing with the installation. Thewipe_disks.ymlplaybook is only meant to be run on systems immediately after runningbm-reimage.yml. If used for any other case, it may not wipe all of the expected partitions.cd ~/scratch/ansible/next/ardana/ansible/ ansible-playbook -i hosts/verb_hosts wipe_disks.yml --limit <hostname>
NoteYou can obtain the
<hostname>from the file~/scratch/ansible/next/ardana/ansible/hosts/verb_hosts.You should verify that the netmask, bootproto, and other necessary settings are correct and if they are not then re-do them. See Book “Installing with Cloud Lifecycle Manager”, Chapter 19 “Installing SLES Compute” for details.
Complete the compute host deployment with these playbooks. For the last one, ensure you specify the compute hosts you are added with the
--limitswitch:cd ~/scratch/ansible/next/ardana/ansible/ ansible-playbook -i hosts/verb_hosts site.yml --tag "generate_hosts_file" ansible-playbook -i hosts/verb_hosts site.yml --limit <hostname>
13.1.3.4.1.3 Adding a new SLES compute node to monitoring #
If you want to add a new Compute node to the monitoring service checks, there is an additional playbook that must be run to ensure this happens:
cd ~/scratch/ansible/next/ardana/ansible ansible-playbook -i hosts/verb_hosts monasca-deploy.yml --tags "active_ping_checks"
13.1.3.5 Removing a Compute Node #
Removing a Compute node allows you to remove capacity.
You may have a need to remove a Compute node and these steps will help you achieve this.
13.1.3.5.1 Disable Provisioning on the Compute Host #
Get a list of the Nova services running which will provide us with the details we need to disable the provisioning on the Compute host you are wanting to remove:
nova service-list
Here is an example below. I've highlighted the Compute node we are going to remove in the examples:
$ nova service-list +----+------------------+--------------------------+----------+---------+-------+----------------------------+-----------------+ | Id | Binary | Host | Zone | Status | State | Updated_at | Disabled Reason | +----+------------------+--------------------------+----------+---------+-------+----------------------------+-----------------+ | 1 | nova-conductor | ardana-cp1-c1-m1-mgmt | internal | enabled | up | 2015-11-22T22:50:43.000000 | - | | 10 | nova-scheduler | ardana-cp1-c1-m1-mgmt | internal | enabled | up | 2015-11-22T22:50:34.000000 | - | | 13 | nova-conductor | ardana-cp1-c1-m3-mgmt | internal | enabled | up | 2015-11-22T22:50:43.000000 | - | | 16 | nova-conductor | ardana-cp1-c1-m2-mgmt | internal | enabled | up | 2015-11-22T22:50:43.000000 | - | | 25 | nova-consoleauth | ardana-cp1-c1-m1-mgmt | internal | enabled | up | 2015-11-22T22:50:38.000000 | - | | 28 | nova-scheduler | ardana-cp1-c1-m2-mgmt | internal | enabled | up | 2015-11-22T22:50:38.000000 | - | | 31 | nova-scheduler | ardana-cp1-c1-m3-mgmt | internal | enabled | up | 2015-11-22T22:50:42.000000 | - | | 34 | nova-compute | ardana-cp1-comp0001-mgmt | AZ1 | enabled | up | 2015-11-22T22:50:35.000000 | - | | 37 | nova-compute | ardana-cp1-comp0002-mgmt | AZ2 | enabled | up | 2015-11-22T22:50:44.000000 | - | +----+------------------+--------------------------+----------+---------+-------+----------------------------+-----------------+Disable the Nova service on the Compute node you are wanting to remove which will ensure it is taken out of the scheduling rotation:
nova service-disable --reason "<enter reason here>" <node hostname> nova-compute
Here is an example if I wanted to remove the
ardana-cp1-comp0002-mgmtin the output above:$ nova service-disable --reason "hardware reallocation" ardana-cp1-comp0002-mgmt nova-compute +--------------------------+--------------+----------+-----------------------+ | Host | Binary | Status | Disabled Reason | +--------------------------+--------------+----------+-----------------------+ | ardana-cp1-comp0002-mgmt | nova-compute | disabled | hardware reallocation | +--------------------------+--------------+----------+-----------------------+
13.1.3.5.2 Remove the Compute Host from its Availability Zone #
If you configured the Compute host to be part of an availability zone, these steps will show you how to remove it.
Get a list of the Nova services running which will provide us with the details we need to remove a Compute node:
nova service-list
Here is an example below. I've highlighted the Compute node we are going to remove in the examples:
$ nova service-list +----+------------------+--------------------------+----------+---------+-------+----------------------------+-----------------------+ | Id | Binary | Host | Zone | Status | State | Updated_at | Disabled Reason | +----+------------------+--------------------------+----------+---------+-------+----------------------------+-----------------------+ | 1 | nova-conductor | ardana-cp1-c1-m1-mgmt | internal | enabled | up | 2015-11-22T22:50:43.000000 | - | | 10 | nova-scheduler | ardana-cp1-c1-m1-mgmt | internal | enabled | up | 2015-11-22T22:50:34.000000 | - | | 13 | nova-conductor | ardana-cp1-c1-m3-mgmt | internal | enabled | up | 2015-11-22T22:50:43.000000 | - | | 16 | nova-conductor | ardana-cp1-c1-m2-mgmt | internal | enabled | up | 2015-11-22T22:50:43.000000 | - | | 25 | nova-consoleauth | ardana-cp1-c1-m1-mgmt | internal | enabled | up | 2015-11-22T22:50:38.000000 | - | | 28 | nova-scheduler | ardana-cp1-c1-m2-mgmt | internal | enabled | up | 2015-11-22T22:50:38.000000 | - | | 31 | nova-scheduler | ardana-cp1-c1-m3-mgmt | internal | enabled | up | 2015-11-22T22:50:42.000000 | - | | 34 | nova-compute | ardana-cp1-comp0001-mgmt | AZ1 | enabled | up | 2015-11-22T22:50:35.000000 | - | | 37 | nova-compute | ardana-cp1-comp0002-mgmt | AZ2 | enabled | up | 2015-11-22T22:50:44.000000 | hardware reallocation | +----+------------------+--------------------------+----------+---------+-------+----------------------------+-----------------------+You can remove the Compute host from the availability zone it was a part of with this command:
nova aggregate-remove-host <availability zone> <nova hostname>
So for the same example as the previous step, the
ardana-cp1-comp0002-mgmthost was in theAZ2availability zone so I would use this command to remove it:$ nova aggregate-remove-host AZ2 ardana-cp1-comp0002-mgmt Host ardana-cp1-comp0002-mgmt has been successfully removed from aggregate 4 +----+------+-------------------+-------+-------------------------+ | Id | Name | Availability Zone | Hosts | Metadata | +----+------+-------------------+-------+-------------------------+ | 4 | AZ2 | AZ2 | | 'availability_zone=AZ2' | +----+------+-------------------+-------+-------------------------+
You can confirm the last two steps completed successfully by running another
nova service-list.Here is an example which confirms that the node has been disabled and that it has been removed from the availability zone. I have highlighted these:
$ nova service-list +----+------------------+--------------------------+----------+----------+-------+----------------------------+-----------------------+ | Id | Binary | Host | Zone | Status | State | Updated_at | Disabled Reason | +----+------------------+--------------------------+----------+----------+-------+----------------------------+-----------------------+ | 1 | nova-conductor | ardana-cp1-c1-m1-mgmt | internal | enabled | up | 2015-11-22T23:04:33.000000 | - | | 10 | nova-scheduler | ardana-cp1-c1-m1-mgmt | internal | enabled | up | 2015-11-22T23:04:34.000000 | - | | 13 | nova-conductor | ardana-cp1-c1-m3-mgmt | internal | enabled | up | 2015-11-22T23:04:33.000000 | - | | 16 | nova-conductor | ardana-cp1-c1-m2-mgmt | internal | enabled | up | 2015-11-22T23:04:33.000000 | - | | 25 | nova-consoleauth | ardana-cp1-c1-m1-mgmt | internal | enabled | up | 2015-11-22T23:04:28.000000 | - | | 28 | nova-scheduler | ardana-cp1-c1-m2-mgmt | internal | enabled | up | 2015-11-22T23:04:28.000000 | - | | 31 | nova-scheduler | ardana-cp1-c1-m3-mgmt | internal | enabled | up | 2015-11-22T23:04:32.000000 | - | | 34 | nova-compute | ardana-cp1-comp0001-mgmt | AZ1 | enabled | up | 2015-11-22T23:04:25.000000 | - | | 37 | nova-compute | ardana-cp1-comp0002-mgmt | nova | disabled | up | 2015-11-22T23:04:34.000000 | hardware reallocation | +----+------------------+--------------------------+----------+----------+-------+----------------------------+-----------------------+
13.1.3.5.3 Use Live Migration to Move Any Instances on this Host to Other Hosts #
You will need to verify if the Compute node is currently hosting any instances on it. You can do this with the command below:
nova list --host=<nova hostname> --all_tenants=1
Here is an example below which shows that we have a single running instance on this node currently:
$ nova list --host=ardana-cp1-comp0002-mgmt --all_tenants=1 +--------------------------------------+--------+----------------------------------+--------+------------+-------------+-----------------+ | ID | Name | Tenant ID | Status | Task State | Power State | Networks | +--------------------------------------+--------+----------------------------------+--------+------------+-------------+-----------------+ | 78fdb938-a89c-4a0c-a0d4-b88f1555c3b9 | paul4d | 5e9998f1b1824ea9a3b06ad142f09ca5 | ACTIVE | - | Running | paul=10.10.10.7 | +--------------------------------------+--------+----------------------------------+--------+------------+-------------+-----------------+
You will likely want to migrate this instance off of this node before removing it. You can do this with the live migration functionality within Nova. The command will look like this:
nova live-migration --block-migrate <nova instance ID>
Here is an example using the instance in the previous step:
$ nova live-migration --block-migrate 78fdb938-a89c-4a0c-a0d4-b88f1555c3b9
You can check the status of the migration using the same command from the previous step:
$ nova list --host=ardana-cp1-comp0002-mgmt --all_tenants=1 +--------------------------------------+--------+----------------------------------+-----------+------------+-------------+-----------------+ | ID | Name | Tenant ID | Status | Task State | Power State | Networks | +--------------------------------------+--------+----------------------------------+-----------+------------+-------------+-----------------+ | 78fdb938-a89c-4a0c-a0d4-b88f1555c3b9 | paul4d | 5e9998f1b1824ea9a3b06ad142f09ca5 | MIGRATING | migrating | Running | paul=10.10.10.7 | +--------------------------------------+--------+----------------------------------+-----------+------------+-------------+-----------------+
Run nova list again
$ nova list --host=ardana-cp1-comp0002-mgmt --all_tenants=1
to see that the running instance has been migrated:
+----+------+-----------+--------+------------+-------------+----------+ | ID | Name | Tenant ID | Status | Task State | Power State | Networks | +----+------+-----------+--------+------------+-------------+----------+ +----+------+-----------+--------+------------+-------------+----------+
13.1.3.5.4 Disable Neutron Agents on Node to be Removed #
You should also locate and disable or remove neutron agents. To see the neutron agents running:
$ neutron agent-list | grep NODE_NAME +--------------------------------------+----------------------+--------------------------+-------+----------------+---------------------------+ | id | agent_type | host | alive | admin_state_up | binary | +--------------------------------------+----------------------+--------------------------+-------+----------------+---------------------------+ | 08f16dbc-4ba2-4c1d-a4a3-a2ff2526ebe4 | L3 agent | ardana-cp1-comp0002-mgmt | :-) | True | neutron-l3-agent | | dbe4fe11-8f08-4306-8244-cc68e98bb770 | Metadata agent | ardana-cp1-comp0002-mgmt | :-) | True | neutron-metadata-agent | | f0d262d1-7139-40c7-bdc2-f227c6dee5c8 | Open vSwitch agent | ardana-cp1-comp0002-mgmt | :-) | True | neutron-openvswitch-agent | +--------------------------------------+----------------------+--------------------------+-------+----------------+---------------------------+ $ neutron agent-update --admin-state-down 08f16dbc-4ba2-4c1d-a4a3-a2ff2526ebe4 $ neutron agent-update --admin-state-down dbe4fe11-8f08-4306-8244-cc68e98bb770 $ neutron agent-update --admin-state-down f0d262d1-7139-40c7-bdc2-f227c6dee5c8
13.1.3.5.5 Shut down or Stop the Nova and Neutron Services on the Compute Host #
To perform this step you have a few options. You can SSH into the Compute host and run the following commands:
sudo systemctl stop nova-compute
sudo systemctl stop neutron-*
Because the Neutron agent self-registers against Neutron server, you may want to prevent the following services from coming back online. Here is how you can get the list:
sudo systemctl list-units neutron-* --all
Here are the results:
UNIT LOAD ACTIVE SUB DESCRIPTION
neutron-common-rundir.service loaded inactive dead Create /var/run/neutron
•neutron-dhcp-agent.service not-found inactive dead neutron-dhcp-agent.service
neutron-l3-agent.service loaded inactive dead neutron-l3-agent Service
neutron-lbaasv2-agent.service loaded inactive dead neutron-lbaasv2-agent Service
neutron-metadata-agent.service loaded inactive dead neutron-metadata-agent Service
•neutron-openvswitch-agent.service loaded failed failed neutron-openvswitch-agent Service
neutron-ovs-cleanup.service loaded inactive dead Neutron OVS Cleanup Service
LOAD = Reflects whether the unit definition was properly loaded.
ACTIVE = The high-level unit activation state, i.e. generalization of SUB.
SUB = The low-level unit activation state, values depend on unit type.
7 loaded units listed.
To show all installed unit files use 'systemctl list-unit-files'.For each loaded service issue the command
sudo systemctl disable <service-name>
In the above example that would be each service, except neutron-dhcp-agent.service
For example:
sudo systemctl disable neutron-common-rundir neutron-l3-agent neutron-lbaasv2-agent neutron-metadata-agent neutron-openvswitch-agent
Now you can shut down the node:
sudo shutdown now
OR
From the Cloud Lifecycle Manager you can use the
bm-power-down.yml playbook to shut down the node:
cd ~/openstack/ardana/ansible ansible-playbook -i hosts/localhost bm-power-down.yml -e nodelist=<node name>
The <node name> value will be the value
corresponding to this node in Cobbler. You can run
sudo cobbler system list to retrieve these names.
13.1.3.5.6 Delete the Compute Host from Nova #
Retrieve the list of Nova services:
nova service-list
Here is an example highlighting the Compute host we're going to remove:
$ nova service-list
+----+------------------+--------------------------+----------+----------+-------+----------------------------+-----------------------+
| Id | Binary | Host | Zone | Status | State | Updated_at | Disabled Reason |
+----+------------------+--------------------------+----------+----------+-------+----------------------------+-----------------------+
| 1 | nova-conductor | ardana-cp1-c1-m1-mgmt | internal | enabled | up | 2015-11-22T23:04:33.000000 | - |
| 10 | nova-scheduler | ardana-cp1-c1-m1-mgmt | internal | enabled | up | 2015-11-22T23:04:34.000000 | - |
| 13 | nova-conductor | ardana-cp1-c1-m3-mgmt | internal | enabled | up | 2015-11-22T23:04:33.000000 | - |
| 16 | nova-conductor | ardana-cp1-c1-m2-mgmt | internal | enabled | up | 2015-11-22T23:04:33.000000 | - |
| 25 | nova-consoleauth | ardana-cp1-c1-m1-mgmt | internal | enabled | up | 2015-11-22T23:04:28.000000 | - |
| 28 | nova-scheduler | ardana-cp1-c1-m2-mgmt | internal | enabled | up | 2015-11-22T23:04:28.000000 | - |
| 31 | nova-scheduler | ardana-cp1-c1-m3-mgmt | internal | enabled | up | 2015-11-22T23:04:32.000000 | - |
| 34 | nova-compute | ardana-cp1-comp0001-mgmt | AZ1 | enabled | up | 2015-11-22T23:04:25.000000 | - |
| 37 | nova-compute | ardana-cp1-comp0002-mgmt | nova | disabled | up | 2015-11-22T23:04:34.000000 | hardware reallocation |
+----+------------------+--------------------------+----------+----------+-------+----------------------------+-----------------------+Delete the host from Nova using the command below:
nova service-delete <service ID>
Following our example above, you would use:
nova service-delete 37
Use the command below to confirm that the Compute host has been completely removed from Nova:
nova hypervisor-list
13.1.3.5.7 Delete the Compute Host from Neutron #
Multiple Neutron agents are running on the compute node. You have to remove all of the agents running on the node using the "neutron agent-delete" command. In the example below, the l3-agent, openvswitch-agent and metadata-agent are running:
$ neutron agent-list | grep NODE_NAME +--------------------------------------+----------------------+--------------------------+-------+----------------+---------------------------+ | id | agent_type | host | alive | admin_state_up | binary | +--------------------------------------+----------------------+--------------------------+-------+----------------+---------------------------+ | 08f16dbc-4ba2-4c1d-a4a3-a2ff2526ebe4 | L3 agent | ardana-cp1-comp0002-mgmt | :-) | False | neutron-l3-agent | | dbe4fe11-8f08-4306-8244-cc68e98bb770 | Metadata agent | ardana-cp1-comp0002-mgmt | :-) | False | neutron-metadata-agent | | f0d262d1-7139-40c7-bdc2-f227c6dee5c8 | Open vSwitch agent | ardana-cp1-comp0002-mgmt | :-) | False | neutron-openvswitch-agent | +--------------------------------------+----------------------+--------------------------+-------+----------------+---------------------------+ $ neutron agent-delete AGENT_ID $ neutron agent-delete 08f16dbc-4ba2-4c1d-a4a3-a2ff2526ebe4 $ neutron agent-delete dbe4fe11-8f08-4306-8244-cc68e98bb770 $ neutron agent-delete f0d262d1-7139-40c7-bdc2-f227c6dee5c8
13.1.3.5.8 Remove the Compute Host from the servers.yml File and Run the Configuration Processor #
Complete these steps from the Cloud Lifecycle Manager to remove the Compute node:
Log in to the Cloud Lifecycle Manager
Edit your
servers.ymlfile in the location below to remove references to the Compute node(s) you want to remove:~/openstack/my_cloud/definition/data/servers.yml
You may also need to edit your
control_plane.ymlfile to update the values formember-count,min-count, andmax-countif you used those to ensure they reflect the proper number of nodes you are using.See Book “Planning an Installation with Cloud Lifecycle Manager”, Chapter 6 “Configuration Objects”, Section 6.2 “Control Plane” for more details.
Commit the changes to git:
git commit -a -m "Remove node <name>"
Run the configuration processor:
cd ~/openstack/ardana/ansible ansible-playbook -i hosts/localhost config-processor-run.yml
To free up the resources when running the configuration processor, use the switches
remove_deleted_serversandfree_unused_addresses. For more information, see Book “Planning an Installation with Cloud Lifecycle Manager”, Chapter 7 “Other Topics”, Section 7.3 “Persisted Data”.cd ~/openstack/ardana/ansible ansible-playbook -i hosts/localhost config-processor-run.yml -e remove_deleted_servers="y" -e free_unused_addresses="y"
Update your deployment directory:
cd ~/openstack/ardana/ansible ansible-playbook -i hosts/localhost ready-deployment.yml
13.1.3.5.9 Remove the Compute Host from Cobbler #
Complete these steps to remove the node from Cobbler:
Confirm the system name in Cobbler with this command:
sudo cobbler system list
Remove the system from Cobbler using this command:
sudo cobbler system remove --name=<node>
Run the
cobbler-deploy.ymlplaybook to complete the process:cd ~/openstack/ardana/ansible ansible-playbook -i hosts/localhost cobbler-deploy.yml
13.1.3.5.10 Remove the Compute Host from Monitoring #
Once you have removed the Compute nodes, the alarms against them will trigger so there are additional steps to take to resolve this issue.
To find all Monasca API servers
tux > sudo cat /etc/haproxy/haproxy.cfg | grep MON
listen ardana-cp1-vip-public-MON-API-extapi-8070
bind ardana-cp1-vip-public-MON-API-extapi:8070 ssl crt /etc/ssl/private//my-public-cert-entry-scale
server ardana-cp1-c1-m1-mgmt-MON_API-8070 ardana-cp1-c1-m1-mgmt:8070 check inter 5000 rise 2 fall 5
server ardana-cp1-c1-m2-mgmt-MON_API-8070 ardana-cp1-c1-m2-mgmt:8070 check inter 5000 rise 2 fall 5
server ardana-cp1-c1-m3-mgmt-MON_API-8070 ardana-cp1-c1-m3-mgmt:8070 check inter 5000 rise 2 fall 5
listen ardana-cp1-vip-MON-API-mgmt-8070
bind ardana-cp1-vip-MON-API-mgmt:8070 ssl crt /etc/ssl/private//ardana-internal-cert
server ardana-cp1-c1-m1-mgmt-MON_API-8070 ardana-cp1-c1-m1-mgmt:8070 check inter 5000 rise 2 fall 5
server ardana-cp1-c1-m2-mgmt-MON_API-8070 ardana-cp1-c1-m2-mgmt:8070 check inter 5000 rise 2 fall 5
server ardana-cp1-c1-m3-mgmt-MON_API-8070 ardana-cp1-c1-m3-mgmt:8070 check inter 5000 rise 2 fall 5In above example ardana-cp1-c1-m1-mgmt,ardana-cp1-c1-m2-mgmt,
ardana-cp1-c1-m3-mgmt are Monasa API servers
You will want to SSH to each of the Monasca API servers and edit the
/etc/monasca/agent/conf.d/host_alive.yaml file to remove
references to the Compute node you removed. This will require
sudo access. The entries will look similar to the one
below:
- alive_test: ping built_by: HostAlive host_name: ardana-cp1-comp0001-mgmt name: ardana-cp1-comp0001-mgmt ping
Once you have removed the references on each of your Monasca API servers you then need to restart the monasca-agent on each of those servers with this command:
tux > sudo service openstack-monasca-agent restartWith the Compute node references removed and the monasca-agent restarted, you can then delete the corresponding alarm to finish this process. To do so we recommend using the Monasca CLI which should be installed on each of your Monasca API servers by default:
monasca alarm-list --metric-dimensions hostname=<compute node deleted>
For example, if your Compute node looked like the example above then you would use this command to get the alarm ID:
monasca alarm-list --metric-dimensions hostname=ardana-cp1-comp0001-mgmt
You can then delete the alarm with this command:
monasca alarm-delete <alarm ID>
13.1.4 Planned Network Maintenance #
Planned maintenance task for networking nodes.
13.1.4.1 Adding a Neutron Network Node #
Adding an additional Neutron networking node allows you to increase the performance of your cloud.
You may have a need to add an additional Neutron network node for increased performance or another purpose and these steps will help you achieve this.
13.1.4.1.1 Prerequisites #
If you are using the mid-scale model then your networking nodes are already
separate and the roles are defined. If you are not already using this model
and wish to add separate networking nodes then you need to ensure that those
roles are defined. You can look in the ~/openstack/examples
folder on your Cloud Lifecycle Manager for the mid-scale example model files which
show how to do this. We have also added the basic edits that need to be made
below:
In your
server_roles.ymlfile, ensure you have theNEUTRON-ROLEdefined.Path to file:
~/openstack/my_cloud/definition/data/server_roles.yml
Example snippet:
- name: NEUTRON-ROLE interface-model: NEUTRON-INTERFACES disk-model: NEUTRON-DISKS
In your
net_interfaces.ymlfile, ensure you have theNEUTRON-INTERFACESdefined.Path to file:
~/openstack/my_cloud/definition/data/net_interfaces.yml
Example snippet:
- name: NEUTRON-INTERFACES network-interfaces: - device: name: hed3 name: hed3 network-groups: - EXTERNAL-VM - GUEST - MANAGEMENTCreate a
disks_neutron.ymlfile, ensure you have theNEUTRON-DISKSdefined in it.Path to file:
~/openstack/my_cloud/definition/data/disks_neutron.yml
Example snippet:
product: version: 2 disk-models: - name: NEUTRON-DISKS volume-groups: - name: ardana-vg physical-volumes: - /dev/sda_root logical-volumes: # The policy is not to consume 100% of the space of each volume group. # 5% should be left free for snapshots and to allow for some flexibility. - name: root size: 35% fstype: ext4 mount: / - name: log size: 50% mount: /var/log fstype: ext4 mkfs-opts: -O large_file - name: crash size: 10% mount: /var/crash fstype: ext4 mkfs-opts: -O large_fileModify your
control_plane.ymlfile, ensure you have theNEUTRON-ROLEdefined as well as the Neutron services added.Path to file:
~/openstack/my_cloud/definition/data/control_plane.yml
Example snippet:
- allocation-policy: strict cluster-prefix: neut member-count: 1 name: neut server-role: NEUTRON-ROLE service-components: - ntp-client - neutron-vpn-agent - neutron-dhcp-agent - neutron-metadata-agent - neutron-openvswitch-agent
You should also have one or more baremetal servers that meet the minimum hardware requirements for a network node which are documented in the Book “Planning an Installation with Cloud Lifecycle Manager”, Chapter 2 “Hardware and Software Support Matrix”.
13.1.4.1.2 Adding a network node #
These steps will show you how to add the new network node to your
servers.yml file and then run the playbooks that update
your cloud configuration. You will run these playbooks from the lifecycle
manager.
Log in to your Cloud Lifecycle Manager.
Checkout the
sitebranch of your local git so you can begin to make the necessary edits:ardana >cd ~/openstack/my_cloud/definition/dataardana >git checkout siteIn the same directory, edit your
servers.ymlfile to include the details about your new network node(s).For example, if you already had a cluster of three network nodes and needed to add a fourth one you would add your details to the bottom of the file in this format:
# network nodes - id: neut3 ip-addr: 10.13.111.137 role: NEUTRON-ROLE server-group: RACK2 mac-addr: "5c:b9:01:89:b6:18" nic-mapping: HP-DL360-6PORT ip-addr: 10.243.140.22 ilo-ip: 10.1.12.91 ilo-password: password ilo-user: admin
ImportantYou will need to verify that the
ip-addrvalue you choose for this node does not conflict with any other IP address in your cloud environment. You can confirm this by checking the~/openstack/my_cloud/info/address_info.ymlfile on your Cloud Lifecycle Manager.In your
control_plane.ymlfile you will need to check the values formember-count,min-count, andmax-count, if you specified them, to ensure that they match up with your new total node count. So for example, if you had previously specifiedmember-count: 3and are adding a fourth network node, you will need to change that value tomember-count: 4.Commit the changes to git:
ardana >git commit -a -m "Add new networking node <name>"Run the configuration processor:
ardana >cd ~/openstack/ardana/ansibleardana >ansible-playbook -i hosts/localhost config-processor-run.ymlUpdate your deployment directory:
ardana >cd ~/openstack/ardana/ansibleardana >ansible-playbook -i hosts/localhost ready-deployment.ymlAdd the new node into Cobbler:
ardana >cd ~/openstack/ardana/ansibleardana >ansible-playbook -i hosts/localhost cobbler-deploy.ymlThen you can image the node:
ardana >cd ~/openstack/ardana/ansibleardana >ansible-playbook -i hosts/localhost bm-reimage.yml -e nodelist=<hostname>NoteIf you do not know the
<hostname>, you can get it by usingsudo cobbler system list.[OPTIONAL] - Run the
wipe_disks.ymlplaybook to ensure all of your non-OS partitions on your nodes are completely wiped prior to continuing with the installation. Thewipe_disks.ymlplaybook is only meant to be run on systems immediately after runningbm-reimage.yml. If used for any other case, it may not wipe all of the expected partitions.ardana >cd ~/scratch/ansible/next/ardana/ansible/ardana >ansible-playbook -i hosts/verb_hosts wipe_disks.yml --limit <hostname>Configure the operating system on the new networking node with this playbook:
ardana >cd ~/scratch/ansible/next/ardana/ansibleardana >ansible-playbook -i hosts/verb_hosts osconfig-run.yml --limit <hostname>Complete the networking node deployment with this playbook:
ardana >cd ~/scratch/ansible/next/ardana/ansibleardana >ansible-playbook -i hosts/verb_hosts ardana-deploy.yml --limit <hostname>Run the
site.ymlplaybook with the required tag so that all other services become aware of the new node:ardana >cd ~/scratch/ansible/next/ardana/ansibleardana >ansible-playbook -i hosts/verb_hosts site.yml --tags "generate_hosts_file"
13.1.4.1.3 Adding a New Network Node to Monitoring #
If you want to add a new networking node to the monitoring service checks, there is an additional playbook that must be run to ensure this happens:
ardana >cd ~/scratch/ansible/next/ardana/ansibleardana >ansible-playbook -i hosts/verb_hosts monasca-deploy.yml --tags "active_ping_checks"
13.1.5 Planned Storage Maintenance #
Planned maintenance procedures for Swift storage nodes.
13.1.5.1 Planned Maintenance Tasks for Swift Nodes #
Planned maintenance tasks including recovering, adding, and removing Swift nodes.
13.1.5.1.1 Adding a Swift Object Node #
Adding additional object nodes allows you to increase capacity.
This topic describes how to add additional Swift object server nodes to an existing system.
13.1.5.1.1.1 To add a new node #
To add a new node to your cloud, you will need to add it to
servers.yml, and then run the scripts that update your
cloud configuration. To begin, access the servers.yml
file by checking out the Git branch where you are required to make
the changes:
Then, perform the following steps to add a new node:
Log in to the Cloud Lifecycle Manager node.
Get the
servers.ymlfile stored in Git:cd ~/openstack/my_cloud/definition/data git checkout site
If not already done, set the
weight-stepattribute. For instructions, see Section 8.5.2, “Using the Weight-Step Attributes to Prepare for Ring Changes”.Add the details of new nodes to the
servers.ymlfile. In the following example only one new server swobj4 is added. However, you can add multiple servers by providing the server details in theservers.ymlfile:servers: ... - id: swobj4 role: SWOBJ_ROLE server-group: <server-group-name> mac-addr: <mac-address> nic-mapping: <nic-mapping-name> ip-addr: <ip-address> ilo-ip: <ilo-ip-address> ilo-user: <ilo-username> ilo-password: <ilo-password>
Commit your changes:
git add -A git commit -m "Add Node <name>"
NoteBefore you run any playbooks, remember that you need to export the encryption key in the following environment variable:
export HOS_USER_PASSWORD_ENCRYPT_KEY=ENCRYPTION_KEY
For instructions, see Book “Installing with Cloud Lifecycle Manager”, Chapter 18 “Installation for SUSE OpenStack Cloud Entry-scale Cloud with Swift Only”.
Run the configuration processor:
cd ~/openstack/ardana/ansible ansible-playbook -i hosts/localhost config-processor-run.yml
Create a deployment directory:
cd ~/openstack/ardana/ansible ansible-playbook -i hosts/localhost ready-deployment.yml
Configure Cobbler to include the new node, and then reimage the node (if you are adding several nodes, use a comma-separated list with the
nodelistargument):cd ~/openstack/ardana/ansible ansible-playbook -i hosts/localhost cobbler-deploy.yml ansible-playbook -i hosts/localhost bm-reimage.yml -e nodelist=<server-id>
In the following example, the server id is swobj4 (mentioned in step 3):
cd ~/openstack/ardana/ansible ansible-playbook -i hosts/localhost cobbler-deploy.yml ansible-playbook -i hosts/localhost bm-reimage.yml -e nodelist=swobj4
NoteYou must use the server id as it appears in the file
servers.ymlin the fieldserver-id.Configure the operating system:
cd ~/scratch/ansible/next/ardana/ansible ansible-playbook -i hosts/verb_hosts osconfig-run.yml --limit <hostname>
The hostname of the newly added server can be found in the list generated from the output of the following command:
grep hostname ~/openstack/my_cloud/info/server_info.yml
For example, for swobj4, the hostname is ardana-cp1-swobj0004-mgmt.
cd ~/scratch/ansible/next/ardana/ansible ansible-playbook -i hosts/verb_hosts osconfig-run.yml --limit ardana-cp1-swobj0004-mgmt
Validate that the disk drives of the new node are compatible with the disk model used by the node:
cd ~/scratch/ansible/next/ardana/ansible ansible-playbook -i hosts/verb_hosts swift-compare-model-rings.yml --limit SWF*
If any errors occur, correct them. For instructions, see Section 15.6.2.3, “Interpreting Swift Input Model Validation Errors”.
Run the following playbook to ensure that all other server's host file are updated with the new server:
cd ~/scratch/ansible/next/ardana/ansible ansible-playbook -i hosts/verb_hosts site.yml --tags "generate_hosts_file"
Run the
ardana-deploy.ymlplaybook to rebalance the rings to include the node, deploy the rings, and configure the new node. Do not limit this to just the node (swobj4) that you are adding:cd ~/scratch/ansible/next/ardana/ansible ansible-playbook -i hosts/verb_hosts ardana-deploy.yml
You may need to perform further rebalances of the rings. For instructions, see the "Weight Change Phase of Ring Rebalance" and the "Final Rebalance Phase" sections of Section 8.5.5, “Applying Input Model Changes to Existing Rings”.
For example:
cd ~/scratch/ansible/next/ardana/ansible ansible-playbook -i hosts/verb_hosts swift-update-from-model-rebalance-rings.yml
13.1.5.1.2 Adding a Swift Proxy, Account, Container (PAC) Node #
Steps for adding additional PAC nodes to your Swift system.
This topic describes how to add additional Swift proxy, account, and container (PAC) servers to an existing system.
13.1.5.1.2.1 Adding a new node #
To add a new node to your cloud, you will need to add it to
servers.yml, and then run the scripts that update your
cloud configuration. To begin, access the servers.yml
file by checking out the Git branch where you are required to make
the changes:
Then, perform the following steps to add a new node:
Log in to the Cloud Lifecycle Manager.
Get the
servers.ymlfile stored in Git:cd ~/openstack/my_cloud/definition/data git checkout site
If not already done, set the weight-step attribute. For instructions, see Section 8.5.2, “Using the Weight-Step Attributes to Prepare for Ring Changes”.
Add details of new nodes to the
servers.ymlfile:servers: ... - id: swpac6 role: SWPAC-ROLE server-group: <server-group-name> mac-addr: <mac-address> nic-mapping: <nic-mapping-name> ip-addr: <ip-address> ilo-ip: <ilo-ip-address> ilo-user: <ilo-username> ilo-password: <ilo-password>
In the above example, only one new server swpac6 is added. However, you can add multiple servers by providing the server details in the
servers.ymlfile.In the entry-scale configurations there is no dedicated Swift PAC cluster. Instead, there is a cluster using servers that have a role of
CONTROLLER-ROLE. You cannot addswpac4to this cluster because that would change themember-count. If your system does not already have a dedicated Swift PAC cluster you will need to add it to the configuration files. For details on how to do this, see Book “Planning an Installation with Cloud Lifecycle Manager”, Chapter 11 “Modifying Example Configurations for Object Storage using Swift”, Section 11.7 “Creating a Swift Proxy, Account, and Container (PAC) Cluster”.If using a new PAC nodes you must add the PAC node's configuration details in the following yaml files:
control_plane.yml disks_pac.yml net_interfaces.yml servers.yml server_roles.yml
You can see a good example of this in the example configurations for the mid-scale model in the
~/openstack/examples/mid-scale-kvmdirectory.The following steps assume that you have already created a dedicated Swift PAC cluster and that it has two members (swpac4 and swpac5).
Increase the member count of the Swift PAC cluster, as appropriate. For example, if you are adding swpac6 and you previously had two Swift PAC nodes, the increased member count should be 3 as shown in the following example:
control-planes: - name: control-plane-1 control-plane-prefix: cp1 . . . clusters: . . . - name: .... cluster-prefix: c2 server-role: SWPAC-ROLE member-count: 3 . . .Commit your changes:
git add -A git commit -m "Add Node <name>"
NoteBefore you run any playbooks, remember that you need to export the encryption key in the following environment variable:
export HOS_USER_PASSWORD_ENCRYPT_KEY=ENCRYPTION_KEY
For instructions, see Book “Installing with Cloud Lifecycle Manager”, Chapter 18 “Installation for SUSE OpenStack Cloud Entry-scale Cloud with Swift Only”.
Run the configuration processor:
cd ~/openstack/ardana/ansible ansible-playbook -i hosts/localhost config-processor-run.yml
Create a deployment directory:
cd ~/openstack/ardana/ansible ansible-playbook -i hosts/localhost ready-deployment.yml
Configure Cobbler to include the new node and reimage the node (if you are adding several nodes, use a comma-separated list for the
nodelistargument):ansible-playbook -i hosts/localhost cobbler-deploy.yml ansible-playbook -i hosts/localhost bm-reimage.yml -e nodelist=<server-id>
In the following example, the server id is swpac6 (mentioned in step 3):
ansible-playbook -i hosts/localhost cobbler-deploy.yml ansible-playbook -i hosts/localhost bm-reimage.yml -e nodelist=swpac6
NoteYou must use the server id as it appears in the file
servers.ymlin the fieldserver-id.Review the
cloudConfig.ymlanddata/control_plane.ymlfiles to get the host prefix (for example, openstack) and the control plane name (for example, cp1). This gives you the hostname of the node. Configure the operating system:cd ~/scratch/ansible/next/ardana/ansible ansible-playbook -i hosts/verb_hosts osconfig-run.yml --limit <hostname>
For example, for swpac6, the hostname is ardana-cp1-c2-m3-mgmt:
cd ~/scratch/ansible/next/ardana/ansible ansible-playbook -i hosts/verb_hosts osconfig-run.yml --limit ardana-cp1-c2-m3-mgmt
Validate that the disk drives of the new node are compatible with the disk model used by the node:
cd ~/scratch/ansible/next/ardana/ansible ansible-playbook -i hosts/verb_hosts swift-compare-model-rings.yml
If any errors occur, correct them. For instructions, see Section 15.6.2.3, “Interpreting Swift Input Model Validation Errors”.
Run the following playbook to ensure that all other server's host file are updated with the new server:
cd ~/scratch/ansible/next/ardana/ansible ansible-playbook -i hosts/verb_hosts site.yml --tags "generate_hosts_file"
Run the
ardana-deploy.ymlplaybook to rebalance the rings to include the node, deploy the rings, and configure the new node. Do not limit this to just the node (swpac6) that you are adding:cd ~/scratch/ansible/next/ardana/ansible ansible-playbook -i hosts/verb_hosts ardana-deploy.yml
You may need to perform further rebalances of the rings. For instructions, see the "Weight Change Phase of Ring Rebalance" and the "Final Rebalance Phase" sections of Section 8.5.5, “Applying Input Model Changes to Existing Rings”.
13.1.5.1.3 Adding Additional Disks to a Swift Node #
Steps for adding additional disks to any nodes hosting Swift services.
You may have a need to add additional disks to a node for Swift usage and we can show you how. These steps work for adding additional disks to Swift object or proxy, account, container (PAC) nodes. It can also apply to adding additional disks to a controller node that is hosting the Swift service, like you would see if you are using one of the entry-scale example models.
Read through the notes below before beginning the process.
You can add multiple disks at the same time, there is no need to do it one at a time.
You must add the same number of disks to each server that the disk model applies to. For example, if you have a single cluster of three Swift servers and you want to increase capacity and decide to add two additional disks, you must add two to each of your three Swift servers.
13.1.5.1.3.1 Adding additional disks to your Swift servers #
Verify the general health of the Swift system and that it is safe to rebalance your rings. See Section 8.5.4, “Determining When to Rebalance and Deploy a New Ring” for details on how to do this.
Perform the disk maintenance.
Shut down the first Swift server you wish to add disks to.
Add the additional disks to the physical server. The disk drives that are added should be clean. They should either contain no partitions or a single partition the size of the entire disk. It should not contain a file system or any volume groups. Failure to comply will cause errors and the disk will not be added.
For more details, see Book “Planning an Installation with Cloud Lifecycle Manager”, Chapter 11 “Modifying Example Configurations for Object Storage using Swift”, Section 11.6 “Swift Requirements for Device Group Drives”.
Power the server on.
While the server was shutdown, data that normally would have been placed on the server is placed elsewhere. When the server is rebooted, the Swift replication process will move that data back onto the server. Monitor the replication process to determine when it is complete. See Section 8.5.4, “Determining When to Rebalance and Deploy a New Ring” for details on how to do this.
Repeat the steps from Step 2.a for each of the Swift servers you are adding the disks to, one at a time.
NoteIf the additional disks can be added to the Swift servers online (for example, via hotplugging) then there is no need to perform the last two steps.
On the Cloud Lifecycle Manager, update your cloud configuration with the details of your additional disks.
Edit the disk configuration file that correlates to the type of server you are adding your new disks to.
Path to the typical disk configuration files:
~/openstack/my_cloud/definition/data/disks_swobj.yml ~/openstack/my_cloud/definition/data/disks_swpac.yml ~/openstack/my_cloud/definition/data/disks_controller_*.yml
Example showing the addition of a single new disk, indicated by the
/dev/sdd, in bold:device-groups: - name: SwiftObject devices: - name: "/dev/sdb" - name: "/dev/sdc" - name: "/dev/sdd" consumer: name: swift ...NoteFor more details on how the disk model works, see Book “Planning an Installation with Cloud Lifecycle Manager”, Chapter 6 “Configuration Objects”.
Configure the Swift weight-step value in the
~/openstack/my_cloud/definition/data/swift/rings.ymlfile. See Section 8.5.2, “Using the Weight-Step Attributes to Prepare for Ring Changes” for details on how to do this.Commit the changes to Git:
cd ~/openstack git commit -a -m "adding additional Swift disks"
Run the configuration processor:
cd ~/openstack/ardana/ansible ansible-playbook -i hosts/localhost config-processor-run.yml
Update your deployment directory:
cd ~/openstack/ardana/ansible ansible-playbook -i hosts/localhost ready-deployment.yml
Run the
osconfig-run.ymlplaybook against the Swift nodes you have added disks to. Use the--limitswitch to target the specific nodes:cd ~/scratch/ansible/next/ardana/ansible ansible-playbook -i hosts/verb_hosts osconfig-run.yml --limit <hostnames>
You can use a wildcard when specifying the hostnames with the
--limitswitch. If you added disks to all of the Swift servers in your environment and they all have the same prefix (for example,ardana-cp1-swobj...) then you can use a wildcard likeardana-cp1-swobj*. If you only added disks to a set of nodes but not all of them, you can use a comma deliminated list and enter the hostnames of each of the nodes you added disks to.Validate your Swift configuration with this playbook which will also provide details of each drive being added:
cd ~/scratch/ansible/next/ardana/ansible ansible-playbook -i hosts/verb_hosts swift-compare-model-rings.yml --extra-vars "drive_detail=yes"
Verify that Swift services are running on all of your servers:
cd ~/scratch/ansible/next/ardana/ansible ansible-playbook -i hosts/verb_hosts swift-status.yml
If everything looks okay with the Swift status, then apply the changes to your Swift rings with this playbook:
cd ~/scratch/ansible/next/ardana/ansible ansible-playbook -i hosts/verb_hosts swift-deploy.yml
At this point your Swift rings will begin rebalancing. You should wait until replication has completed or min-part-hours has elapsed (whichever is longer), as described in Section 8.5.4, “Determining When to Rebalance and Deploy a New Ring” and then follow the "Weight Change Phase of Ring Rebalance" process as described in Section 8.5.5, “Applying Input Model Changes to Existing Rings”.
13.1.5.1.4 Removing a Swift Node #
Removal process for both Swift Object and PAC nodes.
You can use this process when you want to remove one or more Swift nodes permanently. This process applies to both Swift Proxy, Account, Container (PAC) nodes and Swift Object nodes.
13.1.5.1.4.1 Setting the Pass-through Attributes #
This process will remove the Swift node's drives from the rings and move it to the remaining nodes in your cluster.
Log in to the Cloud Lifecycle Manager.
Ensure that the weight-step attribute is set. See Section 8.5.2, “Using the Weight-Step Attributes to Prepare for Ring Changes” for more details.
Add the pass-through definition to your input model, specifying the server ID (as opposed to the server name). It is easiest to include in your
~/openstack/my_cloud/definition/data/servers.ymlfile since your server IDs are already listed in that file. For more information about pass-through, see Book “Planning an Installation with Cloud Lifecycle Manager”, Chapter 6 “Configuration Objects”, Section 6.17 “Pass Through”.Here is the general format:
pass-through: servers: - id: <server-id> data: <subsystem>: <subsystem-attributes>Here is an example:
--- product: version: 2 pass-through: servers: - id: ccn-0001 data: swift: drain: yesBy setting this pass-through attribute, you indicate that the system should reduce the weight of the server's drives. The weight reduction is determined by the weight-step attribute as described in the previous step. This process is known as "draining", where you remove the Swift data from the node in preparation for removing the node.
Commit your configuration to the local Git repository (see Book “Installing with Cloud Lifecycle Manager”, Chapter 10 “Using Git for Configuration Management”), as follows:
cd ~/openstack/ardana/ansible git add -A git commit -m "My config or other commit message"
Run the configuration processor:
cd ~/openstack/ardana/ansible ansible-playbook -i hosts/localhost config-processor-run.yml
Use the playbook to create a deployment directory:
cd ~/openstack/ardana/ansible ansible-playbook -i hosts/localhost ready-deployment.yml
Run the Swift deploy playbook to perform the first ring rebuild. This will remove some of the partitions from all drives on the node you are removing:
cd ~/scratch/ansible/next/ardana/ansible ansible-playbook -i hosts/verb_hosts swift-deploy.yml
Wait until the replication has completed. For further details, see Section 8.5.4, “Determining When to Rebalance and Deploy a New Ring”
Determine whether all of the partitions have been removed from all drives on the Swift node you are removing. You can do this by SSH'ing into the first account server node and using these commands:
cd /etc/swiftlm/cloud1/cp1/builder_dir/ sudo swift-ring-builder <ring_name>.builder
For example, if the node you are removing was part of the object-o ring the command would be:
sudo swift-ring-builder object-0.builder
Check the output. You will need to know the IP address of the server being drained. In the example below, the number of partitions of the drives on 192.168.245.3 has reached zero for the object-0 ring:
$ cd /etc/swiftlm/cloud1/cp1/builder_dir/ $ sudo swift-ring-builder object-0.builder account.builder, build version 6 4096 partitions, 3.000000 replicas, 1 regions, 1 zones, 6 devices, 0.00 balance, 0.00 dispersion The minimum number of hours before a partition can be reassigned is 16 The overload factor is 0.00% (0.000000) Devices: id region zone ip address port replication ip replication port name weight partitions balance meta 0 1 1 192.168.245.3 6002 192.168.245.3 6002 disk0 0.00 0 -0.00 padawan-ccp-c1-m1:disk0:/dev/sdc 1 1 1 192.168.245.3 6002 192.168.245.3 6002 disk1 0.00 0 -0.00 padawan-ccp-c1-m1:disk1:/dev/sdd 2 1 1 192.168.245.4 6002 192.168.245.4 6002 disk0 18.63 2048 -0.00 padawan-ccp-c1-m2:disk0:/dev/sdc 3 1 1 192.168.245.4 6002 192.168.245.4 6002 disk1 18.63 2048 -0.00 padawan-ccp-c1-m2:disk1:/dev/sdd 4 1 1 192.168.245.5 6002 192.168.245.5 6002 disk0 18.63 2048 -0.00 padawan-ccp-c1-m3:disk0:/dev/sdc 5 1 1 192.168.245.5 6002 192.168.245.5 6002 disk1 18.63 2048 -0.00 padawan-ccp-c1-m3:disk1:/dev/sddIf the number of partitions is zero for the server on all rings, you can move to the next step, otherwise continue the ring rebalance cycle by repeating steps 7-9 until the weight has reached zero.
If the number of partitions is zero for the server on all rings, you can remove the Swift nodes' drives from all rings. Edit the pass-through data you created in step #3 and set the
removeattribute as shown in this example:--- product: version: 2 pass-through: servers: - id: ccn-0001 data: swift: remove: yesCommit your configuration to the local Git repository (see Book “Installing with Cloud Lifecycle Manager”, Chapter 10 “Using Git for Configuration Management”), as follows:
cd ~/openstack/ardana/ansible git add -A git commit -m "My config or other commit message"
Run the configuration processor:
cd ~/openstack/ardana/ansible ansible-playbook -i hosts/localhost config-processor-run.yml
Update your deployment directory:
cd ~/openstack/ardana/ansible ansible-playbook -i hosts/localhost ready-deployment.yml
Run the Swift deploy playbook to rebuild the rings by removing the server:
cd ~/scratch/ansible/next/ardana/ansible ansible-playbook -i hosts/verb_hosts swift-deploy.yml
At this stage, the server has been removed from the rings and the data that was originally stored on the server has been replicated in a balanced way to the other servers in the system. You can proceed to the next phase.
13.1.5.1.4.2 To Disable Swift on a Node #
The next phase in this process will disable the Swift service on the node. In this example, swobj4 is the node being removed from Swift.
Log in to the Cloud Lifecycle Manager.
Stop Swift services on the node using the
swift-stop.ymlplaybook:cd ~/scratch/ansible/next/ardana/ansible ansible-playbook -i hosts/verb_hosts swift-stop.yml --limit <hostname>NoteWhen using the
--limitargument, you must specify the full hostname (for example: ardana-cp1-swobj0004) or use the wild card*(for example, *swobj4*).The following example uses the
swift-stop.ymlplaybook to stop Swift services on ardana-cp1-swobj0004:cd ~/scratch/ansible/next/ardana/ansible ansible-playbook -i hosts/verb_hosts swift-stop.yml --limit ardana-cp1-swobj0004Remove the configuration files.
ssh ardana-cp1-swobj4-mgmt sudo rm -R /etc/swift
NoteDo not run any other playbooks until you have finished the process described in Section 13.1.5.1.4.3, “To Remove a Node from the Input Model”. Otherwise, these playbooks may recreate
/etc/swiftand restart Swift on swobj4. If you accidentally run a playbook, repeat the process in Section 13.1.5.1.4.2, “To Disable Swift on a Node”.
13.1.5.1.4.3 To Remove a Node from the Input Model #
Use the following steps to finish the process of removing the Swift node.
Log in to the Cloud Lifecycle Manager.
Edit the
~/openstack/my_cloud/definition/data/servers.ymlfile and remove the entry for the node (swobj4 in this example).If this was a SWPAC node, reduce the member-count attribute by 1 in the
~/openstack/my_cloud/definition/data/control_plane.ymlfile. For SWOBJ nodes, no such action is needed.Commit your configuration to the local Git repository (see Book “Installing with Cloud Lifecycle Manager”, Chapter 10 “Using Git for Configuration Management”), as follows:
cd ~/openstack/ardana/ansible git add -A git commit -m "My config or other commit message"
Run the configuration processor:
cd ~/openstack/ardana/ansible ansible-playbook -i hosts/localhost config-processor-run.yml
You may want to use the
remove_deleted_serversandfree_unused_addressesswitches to free up the resources when running the configuration processor. For more information, see Book “Planning an Installation with Cloud Lifecycle Manager”, Chapter 7 “Other Topics”, Section 7.3 “Persisted Data”.ansible-playbook -i hosts/localhost config-processor-run.yml -e remove_deleted_servers="y" -e free_unused_addresses="y"
Update your deployment directory:
cd ~/openstack/ardana/ansible ansible-playbook -i hosts/localhost ready-deployment.yml
Validate the changes you have made to the configuration files using the playbook below before proceeding further:
cd ~/scratch/ansible/next/ardana/ansible ansible-playbook -i hosts/verb_hosts swift-compare-model-rings.yml --limit SWF*
If any errors occur, correct them in your configuration files and repeat steps 3-5 again until no more errors occur before going to the next step.
For more details on how to interpret and resolve errors, see Section 15.6.2.3, “Interpreting Swift Input Model Validation Errors”
Remove the node from Cobbler:
sudo cobbler system remove --name=swobj4
Run the Cobbler deploy playbook:
cd ~/openstack/ardana/ansible ansible-playbook -i hosts/localhost cobbler-deploy.yml
The final step will depend on what type of Swift node you are removing.
If the node was a SWPAC node, run the
ardana-deploy.ymlplaybook:cd ~/scratch/ansible/next/ardana/ansible ansible-playbook -i hosts/verb_hosts ardana-deploy.yml
If the node was a SWOBJ node, run the
swift-deploy.ymlplaybook:cd ~/scratch/ansible/next/ardana/ansible ansible-playbook -i hosts/verb_hosts swift-deploy.yml
Wait until replication has finished. For more details, see Section 8.5.4, “Determining When to Rebalance and Deploy a New Ring”.
You may need to continue to rebalance the rings. For instructions, see Final Rebalance Phase at Section 8.5.5, “Applying Input Model Changes to Existing Rings”.
13.1.5.1.4.4 Remove the Swift Node from Monitoring #
Once you have removed the Swift node(s), the alarms against them will trigger so there are additional steps to take to resolve this issue.
You will want to SSH to each of the Monasca API servers and edit the
/etc/monasca/agent/conf.d/host_alive.yaml file to remove
references to the Swift node(s) you removed. This will require
sudo access.
Once you have removed the references on each of your Monasca API servers you then need to restart the monasca-agent on each of those servers with this command:
tux > sudo service openstack-monasca-agent restartWith the Swift node references removed and the monasca-agent restarted, you can then delete the corresponding alarm to finish this process. To do so we recommend using the Monasca CLI which should be installed on each of your Monasca API servers by default:
monasca alarm-list --metric-dimensions hostname=<swift node deleted>
You can then delete the alarm with this command:
monasca alarm-delete <alarm ID>
13.1.5.1.5 Replacing a Swift Node #
Maintenance steps for replacing a failed Swift node in your environment.
This process is used when you want to replace a failed Swift node in your cloud.
If it applies to the server, do not skip step 10. If you do, the system will overwrite the existing rings with new rings. This will not cause data loss, but, potentially, will move most objects in your system to new locations and may make data unavailable until the replication process has completed.
13.1.5.1.5.1 How to replace a Swift node in your environment #
Log in to the Cloud Lifecycle Manager.
Update your cloud configuration with the details of your replacement Swift node.
Edit your
servers.ymlfile to include the details (MAC address, IPMI user, password, and IP address (IPME) if these have changed) about your replacement Swift node.NoteDo not change the server's IP address (that is,
ip-addr).Path to file:
~/openstack/my_cloud/definition/data/servers.yml
Example showing the fields to edit, in bold:
- id: swobj5 role: SWOBJ-ROLE server-group: rack2 mac-addr: 8c:dc:d4:b5:cb:bd nic-mapping: HP-DL360-6PORT ip-addr: 10.243.131.10 ilo-ip: 10.1.12.88 ilo-user: iLOuser ilo-password: iLOpass ...
Commit the changes to Git:
cd ~/openstack git commit -a -m "replacing a Swift node"
Run the configuration processor:
cd ~/openstack/ardana/ansible ansible-playbook -i hosts/localhost config-processor-run.yml
Update your deployment directory:
cd ~/openstack/ardana/ansible ansible-playbook -i hosts/localhost ready-deployment.yml
Update Cobbler and reimage your replacement Swift node:
Obtain the name in Cobbler for your node you wish to remove. You will use this value to replace
<node name>in future steps.sudo cobbler system list
Remove the replaced Swift node from Cobbler:
sudo cobbler system remove --name <node name>
Re-run the
cobbler-deploy.ymlplaybook to add the replaced node:cd ~/scratch/ansible/next/ardana/ansible ansible-playbook -i hosts/localhost cobbler-deploy.yml
Reimage the node using this playbook:
cd ~/openstack/ardana/ansible ansible-playbook -i hosts/localhost bm-reimage.yml -e nodelist=<node name>
Complete the deployment of your replacement Swift node.
Obtain the hostname for your new Swift node. You will use this value to replace
<hostname>in future steps.cat ~/openstack/my_cloud/info/server_info.yml
Configure the operating system on your replacement Swift node:
cd ~/scratch/ansible/next/ardana/ansible ansible-playbook -i hosts/verb_hosts osconfig-run.yml -limit <hostname>
If this is the Swift ring builder server, restore the Swift ring builder files to the
/etc/swiftlm/CLOUD-NAME/CONTROL-PLANE-NAME/builder_dirdirectory. For more information and instructions, see Section 15.6.2.4, “Identifying the Swift Ring Building Server” and Section 15.6.2.7, “Recovering Swift Builder Files”.Configure services on the node using the
ardana-deploy.ymlplaybook. If you have used an encryption password when running the configuration processor, include the--ask-vault-passargument.ansible-playbook -i hosts/verb_hosts ardana-deploy.yml --ask-vault-pass --limit <hostname>
13.1.5.1.6 Replacing Drives in a Swift Node #
Maintenance steps for replacing drives in a Swift node.
This process is used when you want to remove a failed hard drive from Swift node and replace it with a new one.
There are two different classes of drives in a Swift node that needs to be replaced; the operating system disk drive (generally /dev/sda) and storage disk drives. There are different procedures for the replacement of each class of drive to bring the node back to normal.
13.1.5.1.6.1 To Replace the Operating System Disk Drive #
After the operating system disk drive is replaced, the node must be reimaged.
Log in to the Cloud Lifecycle Manager.
Update your Cobbler profile:
cd ~/scratch/ansible/next/ardana/ansible ansible-playbook -i hosts/localhost cobbler-deploy.yml
Reimage the node using this playbook:
cd ~/openstack/ardana/ansible ansible-playbook -i hosts/localhost bm-reimage.yml -e nodelist=<server name>
In the example below swobj2 server is reimaged:
cd ~/openstack/ardana/ansible ansible-playbook -i hosts/localhost bm-reimage.yml -e nodelist=swobj2
Review the
cloudConfig.ymlanddata/control_plane.ymlfiles to get the host prefix (for example, openstack) and the control plane name (for example, cp1). This gives you the hostname of the node. Configure the operating system:cd ~/scratch/ansible/next/ardana/ansible ansible-playbook -i hosts/verb_hosts osconfig-run.yml --limit <hostname>
In the following example, for swobj2, the hostname is ardana-cp1-swobj0002:
cd ~/scratch/ansible/next/ardana/ansible ansible-playbook -i hosts/verb_hosts osconfig-run.yml -limit ardana-cp1-swobj0002*
If this is the first server running the swift-proxy service, restore the Swift Ring Builder files to the
/etc/swiftlm/CLOUD-NAME/CONTROL-PLANE-NAME/builder_dirdirectory. For more information and instructions, see Section 15.6.2.4, “Identifying the Swift Ring Building Server” and Section 15.6.2.7, “Recovering Swift Builder Files”.Configure services on the node using the
ardana-deploy.ymlplaybook. If you have used an encryption password when running the configuration processor include the--ask-vault-passargument.ansible-playbook -i hosts/verb_hosts ardana-deploy.yml --ask-vault-pass \ --limit <hostname>
For example:
ansible-playbook -i hosts/verb_hosts ardana-deploy.yml --ask-vault-pass --limit ardana-cp1-swobj0002*
13.1.5.1.6.2 To Replace a Storage Disk Drive #
After a storage drive is replaced, there is no need to reimage the server.
Instead, run the swift-reconfigure.yml playbook.
Log onto the Cloud Lifecycle Manager.
Run the following commands:
cd ~/scratch/ansible/next/ardana/ansible ansible-playbook -i hosts/verb_hosts swift-reconfigure.yml --limit <hostname>
In following example, the server used is swobj2:
cd ~/scratch/ansible/next/ardana/ansible ansible-playbook -i hosts/verb_hosts swift-reconfigure.yml --limit ardana-cp1-swobj0002-mgmt
13.1.6 Updating MariaDB with Galera #
Updating MariaDB with Galera must be done manually. Updates are not installed automatically. In particular, this situation applies to upgrades to MariaDB 10.2.17 or higher from MariaDB 10.2.16 or earlier. See MariaDB 10.2.22 Release Notes - Notable Changes.
Using the CLI, update MariaDB with the following procedure:
Mark Galera as unmanaged:
crm resource unmanage galera
Or put the whole cluster into maintenance mode:
crm configure property maintenance-mode=true
Pick a node other than the one currently targeted by the loadbalancer and stop MariaDB on that node:
crm_resource --wait --force-demote -r galera -V
Perform updates:
Uninstall the old versions of MariaDB and the Galera wsrep provider.
Install the new versions of MariaDB and the Galera wsrep provider. Select the appropriate instructions at Installing MariaDB with zypper.
Change configuration options if necessary.
Start MariaDB on the node.
crm_resource --wait --force-promote -r galera -V
Run
mysql_upgradewith the--skip-write-binlogoption.On the other nodes, repeat the process detailed above: stop MariaDB, perform updates, start MariaDB, run
mysql_upgrade.Mark Galera as managed:
crm resource manage galera
Or take the cluster out of maintenance mode.
13.2 Unplanned System Maintenance #
Unplanned maintenance tasks for your cloud.
13.2.1 Whole Cloud Recovery Procedures #
Unplanned maintenance procedures for your whole cloud.
13.2.1.1 Full Disaster Recovery #
In this disaster scenario, you have lost everything in the cloud, including Swift.
13.2.1.1.1 Restore from a Swift backup: #
Restoring from a Swift backup is not possible because Swift is gone.
13.2.1.1.2 Restore from an SSH backup: #
Log in to the Cloud Lifecycle Manager.
Edit the following file so it contains the same information as it had previously:
~/openstack/my_cloud/config/freezer/ssh_credentials.yml
On the Cloud Lifecycle Manager copy the following files:
cp -r ~/hp-ci/openstack/* ~/openstack/my_cloud/definition/
Run this playbook to restore the Cloud Lifecycle Manager helper:
cd ~/openstack/ardana/ansible/ ansible-playbook -i hosts/localhost _deployer_restore_helper.yml
Run as root, and change directories:
sudo su cd /root/deployer_restore_helper/
Execute the restore:
./deployer_restore_script.sh
Run this playbook to deploy your cloud:
cd ~/scratch/ansible/next/ardana/ansible ansible-playbook -i hosts/verb_hosts site.yml -e '{ "freezer_backup_jobs_upload": false }'You can now perform the procedures to restore MySQL and Swift. Once everything is restored, re-enable the backups from the Cloud Lifecycle Manager:
cd ~/scratch/ansible/next/ardana/ansible ansible-playbook -i hosts/verb_hosts _freezer_manage_jobs.yml
13.2.1.2 Full Disaster Recovery Test #
Full Disaster Recovery Test
13.2.1.2.1 Prerequisites #
SUSE OpenStack Cloud platform
An external server to store backups to via SSH
13.2.1.2.2 Goals #
Here is a high level view of how we expect to test the disaster recovery of the platform.
Backup the control plane using Freezer to an SSH target
Backup the Cassandra Database
Re-install Controller 1 with the SUSE OpenStack Cloud ISO
Use Freezer to recover deployment data (model …)
Re-install SUSE OpenStack Cloud on Controller 1, 2, 3
Recover the Cassandra Database
Recover the backup of the MariaDB database
13.2.1.2.3 Description of the testing environment #
The testing environment is very similar to the Entry Scale model.
It used 5 servers: 3 Controllers and 2 computes.
The controller node have three disks. The first one is reserved for the system, while others are used for swift.
During this Disaster Recovery exercise, we have saved the data on disk 2 and 3 of the swift controllers.
This allow to restore the swift objects after the recovery.
If these disks were to be wiped as well, swift data would be lost but the procedure would not change.
The only difference is that Glance images would be lost and they will have to be re-uploaded.
13.2.1.2.4 Disaster recovery test note #
If it is not specified otherwise, all the commands should be executed on controller 1, which is also the deployer node.
13.2.1.2.5 Pre-Disaster testing #
In order to validate the procedure after recovery, we need to create some workloads.
Source the service credential file
ardana >source ~/service.osrcCopy an image to the platform and create a Glance image with it. In this example, Cirros is used
ardana >openstack image create --disk-format raw --container-format bare --public --file ~/cirros-0.3.5-x86_64-disk.img cirrosCreate a network
ardana >openstack network create test_netCreate a subnet
ardana >neutron subnet-create 07c35d11-13f9-41d4-8289-fa92147b1d44 192.168.42.0/24 --name test_subnetCreate some instances
ardana >openstack server create server_1 --image 411a0363-7f4b-4bbc-889c-b9614e2da52e --flavor m1.small --nic net-id=07c35d11-13f9-41d4-8289-fa92147b1d44ardana >openstack server create server_2 --image 411a0363-7f4b-4bbc-889c-b9614e2da52e --flavor m1.small --nic net-id=07c35d11-13f9-41d4-8289-fa92147b1d44ardana >openstack server create server_3 --image 411a0363-7f4b-4bbc-889c-b9614e2da52e --flavor m1.small --nic net-id=07c35d11-13f9-41d4-8289-fa92147b1d44ardana >openstack server create server_4 --image 411a0363-7f4b-4bbc-889c-b9614e2da52e --flavor m1.small --nic net-id=07c35d11-13f9-41d4-8289-fa92147b1d44ardana >openstack server create server_5 --image 411a0363-7f4b-4bbc-889c-b9614e2da52e --flavor m1.small --nic net-id=07c35d11-13f9-41d4-8289-fa92147b1d44ardana >openstack server listCreate containers and objects
ardana >swift upload container_1 ~/service.osrc var/lib/ardana/service.osrcardana >swift upload container_1 ~/backup.osrc swift upload container_1 ~/backup.osrcardana >swift list container_1 var/lib/ardana/backup.osrc var/lib/ardana/service.osrc
13.2.1.2.6 Preparation of the backup server #
Preparation of the backup server
13.2.1.2.6.1 Preparation to store Freezer backups #
In this example, we want to store the backups on the server 192.168.69.132
Freezer will connect with the user backupuser on port 22 and store the backups in the /mnt/backups/ directory.
Connect to the backup server
Create the user
root #useradd backupuser --create-home --home-dir /mnt/backups/Switch to that user
root #su backupuserCreate the SSH keypair
backupuser >ssh-keygen -t rsa > # Just leave the default for the first question and do not set any passphrase > Generating public/private rsa key pair. > Enter file in which to save the key (/mnt/backups//.ssh/id_rsa): > Created directory '/mnt/backups//.ssh'. > Enter passphrase (empty for no passphrase): > Enter same passphrase again: > Your identification has been saved in /mnt/backups//.ssh/id_rsa > Your public key has been saved in /mnt/backups//.ssh/id_rsa.pub > The key fingerprint is: > a9:08:ae:ee:3c:57:62:31:d2:52:77:a7:4e:37:d1:28 backupuser@padawan-ccp-c0-m1-mgmt > The key's randomart image is: > +---[RSA 2048]----+ > | o | > | . . E + . | > | o . . + . | > | o + o + | > | + o o S . | > | . + o o | > | o + . | > |.o . | > |++o | > +-----------------+Add the public key to the list of the keys authorized to connect to that user on this server
backupuser >cat /mnt/backups/.ssh/id_rsa.pub >> /mnt/backups/.ssh/authorized_keysPrint the private key. This is what we will use for the backup configuration (ssh_credentials.yml file)
backupuser >cat /mnt/backups/.ssh/id_rsa > -----BEGIN RSA PRIVATE KEY----- > MIIEogIBAAKCAQEAvjwKu6f940IVGHpUj3ffl3eKXACgVr3L5s9UJnb15+zV3K5L > BZuor8MLvwtskSkgdXNrpPZhNCsWSkryJff5I335Jhr/e5o03Yy+RqIMrJAIa0X5 > ... > ... > ... > iBKVKGPhOnn4ve3dDqy3q7fS5sivTqCrpaYtByJmPrcJNjb2K7VMLNvgLamK/AbL > qpSTZjicKZCCl+J2+8lrKAaDWqWtIjSUs29kCL78QmaPOgEvfsw= > -----END RSA PRIVATE KEY-----
13.2.1.2.6.2 Preparation to store Cassandra backups #
In this example, we want to store the backups on the server 192.168.69.132. We will store the backups in the /mnt/backups/cassandra_backups/ directory.
Create a directory on the backup server to store cassandra backups
backupuser >mkdir /mnt/backups/cassandra_backupsCopy private ssh key from backupserver to all controller nodes
backupuser >scp /mnt/backups/.ssh/id_rsa ardana@CONTROLLER:~/.ssh/id_rsa_backup Password: id_rsa 100% 1675 1.6KB/s 00:00Replace CONTROLLER with each control node e.g. doc-cp1-c1-m1-mgmt, doc-cp1-c1-m2-mgmt etc
Login to each controller node and copy private ssh key to the root user's .ssh directory
tux >sudo cp /var/lib/ardana/.ssh/id_rsa_backup /root/.ssh/Verify that you can ssh to backup server as backup user using the private key
root #ssh -i ~/.ssh/id_rsa_backup backupuser@doc-cp1-comp0001-mgmt
13.2.1.2.7 Perform Backups for disaster recovery test #
Perform Backups for disaster recovery
13.2.1.2.7.1 Execute backup of Cassandra #
Execute backup of Cassandra
Create cassandra-backup-extserver.sh script on all controller nodes where Cassandra runs, which can be determined by running this command on deployer
ardana >cd ~/scratch/ansible/next/ardana/ansibleardana >ansible FND-CDB --list-hosts
root # cat > ~/cassandra-backup-extserver.sh << EOF
#!/bin/sh
# backup user
BACKUP_USER=backupuser
# backup server
BACKUP_SERVER=192.168.69.132
# backup directory
BACKUP_DIR=/mnt/backups/cassandra_backups/
# Setup variables
DATA_DIR=/var/cassandra/data/data
NODETOOL=/usr/bin/nodetool
# e.g. cassandra-snp-2018-06-26-1003
SNAPSHOT_NAME=cassandra-snp-\$(date +%F-%H%M)
HOST_NAME=\$(/bin/hostname)_
# Take a snapshot of cassandra database
\$NODETOOL snapshot -t \$SNAPSHOT_NAME monasca
# Collect a list of directories that make up the snapshot
SNAPSHOT_DIR_LIST=\$(find \$DATA_DIR -type d -name \$SNAPSHOT_NAME)
for d in \$SNAPSHOT_DIR_LIST
do
# copy snapshot directories to external server
rsync -avR -e "ssh -i /root/.ssh/id_rsa_backup" \$d \$BACKUP_USER@\$BACKUP_SERVER:\$BACKUP_DIR/\$HOST_NAME\$SNAPSHOT_NAME
done
\$NODETOOL clearsnapshot monasca
EOFroot # chmod +x ~/cassandra-backup-extserver.shExecute following steps on all the controller nodes
/usr/local/sbin/cassandra-backup-extserver.sh should be executed on all the three controller nodes at the same time (within seconds of each other) for a successful backup
Edit /usr/local/sbin/cassandra-backup-extserver.sh script
Set
BACKUP_USERandBACKUP_SERVERto the desired backup user (for example,backupuser) and desired backup server (for example,192.168.68.132), respectively.BACKUP_USER=backupuser BACKUP_SERVER=192.168.69.132 BACKUP_DIR=/mnt/backups/cassandra_backups/
Execute ~/cassandra-backup-extserver.sh
root #~/cassandra-backup-extserver.sh (on all controller nodes which are also cassandra nodes) Requested creating snapshot(s) for [monasca] with snapshot name [cassandra-snp-2018-06-28-0251] and options {skipFlush=false} Snapshot directory: cassandra-snp-2018-06-28-0251 sending incremental file list created directory /mnt/backups/cassandra_backups//doc-cp1-c1-m1-mgmt_cassandra-snp-2018-06-28-0251 /var/ /var/cassandra/ /var/cassandra/data/ /var/cassandra/data/data/ /var/cassandra/data/data/monasca/ ... ... ... /var/cassandra/data/data/monasca/measurements-e29033d0488d11e8bdabc32666406af1/snapshots/cassandra-snp-2018-06-28-0306/mc-72-big-Summary.db /var/cassandra/data/data/monasca/measurements-e29033d0488d11e8bdabc32666406af1/snapshots/cassandra-snp-2018-06-28-0306/mc-72-big-TOC.txt /var/cassandra/data/data/monasca/measurements-e29033d0488d11e8bdabc32666406af1/snapshots/cassandra-snp-2018-06-28-0306/schema.cql sent 173,691 bytes received 531 bytes 116,148.00 bytes/sec total size is 171,378 speedup is 0.98 Requested clearing snapshot(s) for [monasca]Verify cassandra backup directory on backup server
backupuser >ls -alt /mnt/backups/cassandra_backups total 16 drwxr-xr-x 4 backupuser users 4096 Jun 28 03:06 . drwxr-xr-x 3 backupuser users 4096 Jun 28 03:06 doc-cp1-c1-m2-mgmt_cassandra-snp-2018-06-28-0306 drwxr-xr-x 3 backupuser users 4096 Jun 28 02:51 doc-cp1-c1-m1-mgmt_cassandra-snp-2018-06-28-0251 drwxr-xr-x 8 backupuser users 4096 Jun 27 20:56 .. $backupuser@backupserver> du -shx /mnt/backups/cassandra_backups/* 6.2G /mnt/backups/cassandra_backups/doc-cp1-c1-m1-mgmt_cassandra-snp-2018-06-28-0251 6.3G /mnt/backups/cassandra_backups/doc-cp1-c1-m2-mgmt_cassandra-snp-2018-06-28-0306
13.2.1.2.7.2 Execute backup of SUSE OpenStack Cloud #
Execute backup of SUSE OpenStack Cloud
Edit the configuration file for SSH backups (be careful to format the private key as requested: pipe on the first line and two spaces indentation). The private key is the key we created on the backup server earlier.
ardana >vi ~/openstack/my_cloud/config/freezer/ssh_credentials.yml $ cat ~/openstack/my_cloud/config/freezer/ssh_credentials.yml freezer_ssh_host: 192.168.69.132 freezer_ssh_port: 22 freezer_ssh_username: backupuser freezer_ssh_base_dir: /mnt/backups freezer_ssh_private_key: | -----BEGIN RSA PRIVATE KEY----- MIIEowIBAAKCAQEAyzhZ+F+sXQp70N8zCDDb6ORKAxreT/qD4zAetjOTuBoFlGb8 pRBY79t9vNp7qvrKaXHBfb1OkKzhqyUwEqNcC9bdngABbb8KkCq+OkfDSAZRrmja wa5PzgtSaZcSJm9jQcF04Fq19mZY2BLK3OJL4qISp1DmN3ZthgJcpksYid2G3YG+ bY/EogrQrdgHfcyLaoEkiBWQSBTEENKTKFBB2jFQYdmif3KaeJySv9cJqihmyotB s5YTdvB5Zn/fFCKG66THhKnIm19NftbJcKc+Y3Z/ZX4W9SpMSj5dL2YW0Y176mLy gMLyZK9u5k+fVjYLqY7XlVAFalv9+HZsvQ3OQQIDAQABAoIBACfUkqXAsrrFrEDj DlCDqwZ5gBwdrwcD9ceYjdxuPXyu9PsCOHBtxNC2N23FcMmxP+zs09y+NuDaUZzG vCZbCFZ1tZgbLiyBbiOVjRVFLXw3aNkDSiT98jxTMcLqTi9kU5L2xN6YSOPTaYRo IoSqge8YjwlmLMkgGBVU7y3UuCmE/Rylclb1EI9mMPElTF+87tYK9IyA2QbIJm/w 4aZugSZa3PwUvKGG/TCJVD+JfrZ1kCz6MFnNS1jYT/cQ6nzLsQx7UuYLgpvTMDK6 Fjq63TmVg9Z1urTB4dqhxzpDbTNfJrV55MuA/z9/qFHs649tFB1/hCsG3EqWcDnP mcv79nECgYEA9WdOsDnnCI1bamKA0XZxovb2rpYZyRakv3GujjqDrYTI97zoG+Gh gLcD1EMLnLLQWAkDTITIf8eurkVLKzhb1xlN0Z4xCLs7ukgMetlVWfNrcYEkzGa8 wec7n1LfHcH5BNjjancRH0Q1Xcc2K7UgGe2iw/Iw67wlJ8i5j2Wq3sUCgYEA0/6/ irdJzFB/9aTC8SFWbqj1DdyrpjJPm4yZeXkRAdn2GeLU2jefqPtxYwMCB1goeORc gQLspQpxeDvLdiQod1Y1aTAGYOcZOyAatIlOqiI40y3Mmj8YU/KnL7NMkaYBCrJh aW//xo+l20dz52pONzLFjw1tW9vhCsG1QlrCaU0CgYB03qUn4ft4JDHUAWNN3fWS YcDrNkrDbIg7MD2sOIu7WFCJQyrbFGJgtUgaj295SeNU+b3bdCU0TXmQPynkRGvg jYl0+bxqZxizx1pCKzytoPKbVKCcw5TDV4caglIFjvoz58KuUlQSKt6rcZMHz7Oh BX4NiUrpCWo8fyh39Tgh7QKBgEUajm92Tc0XFI8LNSyK9HTACJmLLDzRu5d13nV1 XHDhDtLjWQUFCrt3sz9WNKwWNaMqtWisfl1SKSjLPQh2wuYbqO9v4zRlQJlAXtQo yga1fxZ/oGlLVe/PcmYfKT91AHPvL8fB5XthSexPv11ZDsP5feKiutots47hE+fc U/ElAoGBAItNX4jpUfnaOj0mR0L+2R2XNmC5b4PrMhH/+XRRdSr1t76+RJ23MDwf SV3u3/30eS7Ch2OV9o9lr0sjMKRgBsLZcaSmKp9K0j/sotwBl0+C4nauZMUKDXqg uGCyWeTQdAOD9QblzGoWy6g3ZI+XZWQIMt0pH38d/ZRbuSUk5o5v -----END RSA PRIVATE KEY-----Save the modifications in the GIT repository
ardana >cd ~/openstack/ardana >git add -Aardana >git commit -a -m "SSH backup configuration"ardana >cd ~/openstack/ardana/ansibleardana >ansible-playbook -i hosts/localhost config-processor-run.ymlardana >ansible-playbook -i hosts/localhost ready-deployment.ymlCreate the Freezer jobs
ardana >cd ~/scratch/ansible/next/ardana/ansibleardana >ansible-playbook -i hosts/verb_hosts _freezer_manage_jobs.ymlWait until all the SSH backup jobs have finished running
Freezer backup jobs are scheduled at interval specified in job specification
You will have to wait for the scheduled time interval for the backup job to run
To find the interval:
ardana >freezer job-list | grep SSH | 34c1364692f64a328c38d54b95753844 | Ardana Default: deployer backup to SSH | 7 | success | scheduled | | | | 944154642f624bb7b9ff12c573a70577 | Ardana Default: swift backup to SSH | 1 | success | scheduled | | | | 22c6bab7ac4d43debcd4f5a9c4c4bb19 | Ardana Default: mysql backup to SSH | 1 | success | scheduled | | |ardana >freezer job-show 944154642f624bb7b9ff12c573a70577 +-------------+---------------------------------------------------------------------------------+ | Field | Value | +-------------+---------------------------------------------------------------------------------+ | Job ID | 944154642f624bb7b9ff12c573a70577 | | Client ID | ardana-qe201-cp1-c1-m1-mgmt | | User ID | 33a6a77adc4b4799a79a4c3bd40f680d | | Session ID | | | Description | Ardana Default: swift backup to SSH | | Actions | [{u'action_id': u'e8373b03ca4b41fdafd83f9ba7734bfa', | | | u'freezer_action': {u'action': u'backup', | | | u'backup_name': u'freezer_swift_builder_dir_backup', | | | u'container': u'/mnt/backups/freezer_rings_backups', | | | u'log_config_append': u'/etc/freezer/agent-logging.conf', | | | u'max_level': 14, | | | u'path_to_backup': u'/etc/swiftlm/', | | | u'remove_older_than': 90, | | | u'snapshot': True, | | | u'ssh_host': u'192.168.69.132', | | | u'ssh_key': u'/etc/freezer/ssh_key', | | | u'ssh_port': u'22', | | | u'ssh_username': u'backupuser', | | | u'storage': u'ssh'}, | | | u'max_retries': 5, | | | u'max_retries_interval': 60, | | | u'user_id': u'33a6a77adc4b4799a79a4c3bd40f680d'}] | | Start Date | | | End Date | | | Interval | 24 hours | +-------------+---------------------------------------------------------------------------------+Swift SSH backup job has Interval of 24 hours, so the next backup would run after 24 hours.
In the default installation Interval for various backup jobs are:
Table 13.1: Default Interval for Freezer backup jobs #Job Name Interval Ardana Default: deployer backup to SSH 48 hours Ardana Default: mysql backup to SSH 12 hours Ardana Default: swift backup to SSH 24 hours You will have to wait for as long as 48 hours for all the backup jobs to run
On the backup server, you can verify that the backup files are present
backupuser >ls -lah /mnt/backups/ total 16 drwxr-xr-x 2 backupuser users 4096 Jun 27 2017 bin drwxr-xr-x 2 backupuser users 4096 Jun 29 14:04 freezer_database_backups drwxr-xr-x 2 backupuser users 4096 Jun 29 14:05 freezer_lifecycle_manager_backups drwxr-xr-x 2 backupuser users 4096 Jun 29 14:05 freezer_rings_backupsbackupuser >du -shx * 4.0K bin 509M freezer_audit_logs_backups 2.8G freezer_database_backups 24G freezer_lifecycle_manager_backups 160K freezer_rings_backups
13.2.1.2.8 Restore of the first controller #
Restore of the first controller
Edit the SSH backup configuration (re-enter the same information as earlier)
ardana >vi ~/openstack/my_cloud/config/freezer/ssh_credentials.ymlExecute the restore helper. When prompted, enter the hostname the first controller had. In this example:
doc-cp1-c1-m1-mgmtardana >cd ~/openstack/ardana/ansibleardana >ansible-playbook -i hosts/localhost _deployer_restore_helper.ymlExecute the restore. When prompted, leave the first value empty (none) and validate the restore by typing 'yes'.
ardana >sudo su cd /root/deployer_restore_helper/ ./deployer_restore_script.shCreate a restore file for Swift rings
ardana >nano swift_rings_restore.iniardana >cat swift_rings_restore.iniHelp:
[default] action = restore storage = ssh # backup server ip ssh_host = 192.168.69.132 # username to connect to the backup server ssh_username = backupuser ssh_key = /etc/freezer/ssh_key # base directory for backups on the backup server container = /mnt/backups/freezer_ring_backups backup_name = freezer_swift_builder_dir_backup restore_abs_path = /etc/swiftlm log_file = /var/log/freezer-agent/freezer-agent.log # hostname that the controller hostname = doc-cp1-c1-m1-mgmt overwrite = True
Execute the restore of the swift rings
ardana >freezer-agent --config ./swift_rings_restore.ini
13.2.1.2.9 Re-deployment of controllers 1, 2 and 3 #
Re-deployment of controllers 1, 2 and 3
Change back to the default ardana user
Deactivate the freezer backup jobs (otherwise empty backups would be added on top of the current good backups)
ardana >nano ~/openstack/my_cloud/config/freezer/activate_jobs.ymlardana >cat ~/openstack/my_cloud/config/freezer/activate_jobs.yml # If set to false, We wont create backups jobs. freezer_create_backup_jobs: false # If set to false, We wont create restore jobs. freezer_create_restore_jobs: trueSave the modification in the GIT repository
ardana >cd ~/openstack/ardana >git add -Aardana >git commit -a -m "De-Activate SSH backup jobs during re-deployment"ardana >ansible-playbook -i hosts/localhost config-processor-run.ymlardana >ansible-playbook -i hosts/localhost ready-deployment.ymlRun the cobbler-deploy.yml playbook
ardana >~/openstack/ardana/ansibleardana >ansible-playbook -i hosts/localhost cobbler-deploy.xmlRun the bm-reimage.yml playbook limited to the second and third controller
ardana >ansible-playbook -i hosts/localhost bm-reimage.yml -e nodelist=controller2,controller3controller2 and controller3 names can vary. You can use the bm-power-status.yml playbook in order to check the cobbler names of these nodes.
Run the site.yml playbook limited to the three controllers and localhost. In this example, this means: doc-cp1-c1-m1-mgmt, doc-cp1-c1-m2-mgmt, doc-cp1-c1-m3-mgmt and localhost
ardana >cd ~/scratch/ansible/next/ardana/ansible ansible-playbook -i hosts/verb_hosts site.yml --limit doc-cp1-c1-m1-mgmt,doc-cp1-c1-m2-mgmt,doc-cp1-c1-m3-mgmt,localhost
13.2.1.2.10 Cassandra database restore #
Cassandra database restore
Create a script cassandra-restore-extserver.sh on all controller nodes
root # cat > ~/cassandra-restore-extserver.sh << EOF
#!/bin/sh
# backup user
BACKUP_USER=backupuser
# backup server
BACKUP_SERVER=192.168.69.132
# backup directory
BACKUP_DIR=/mnt/backups/cassandra_backups/
# Setup variables
DATA_DIR=/var/cassandra
NODETOOL=/usr/bin/nodetool
HOST_NAME=\$(/bin/hostname)_
#Get snapshot name from command line.
if [ -z "\$*" ]
then
echo "usage \$0 <snapshot to restore>"
exit 1
fi
SNAPSHOT_NAME=\$1
# restore
rsync -av -e "ssh -i /root/.ssh/id_rsa_backup" \$BACKUP_USER@\$BACKUP_SERVER:\$BACKUP_DIR/\$HOST_NAME\$SNAPSHOT_NAME/ /
# set ownership of newley restored files
chown -R cassandra:cassandra \$DATA_DIR
# Get a list of snapshot directories that have files to be restored.
RESTORE_LIST=\$(find \$DATA_DIR -type d -name \$SNAPSHOT_NAME)
# use RESTORE_LIST to move snapshot files back into place of database.
for d in \$RESTORE_LIST
do
cd \$d
mv * ../..
KEYSPACE=\$(pwd | rev | cut -d '/' -f4 | rev)
TABLE_NAME=\$(pwd | rev | cut -d '/' -f3 |rev | cut -d '-' -f1)
\$NODETOOL refresh \$KEYSPACE \$TABLE_NAME
done
cd
# Cleanup snapshot directories
\$NODETOOL clearsnapshot \$KEYSPACE
EOFroot # chmod +x ~/cassandra-restore-extserver.shExecute following steps on all the controller nodes
Edit ~/cassandra-restore-extserver.sh script
Set BACKUP_USER,BACKUP_SERVER to the desired backup user (for example,
backupuser) and the desired backup server (for example,192.168.68.132), respectively.BACKUP_USER=backupuser BACKUP_SERVER=192.168.69.132 BACKUP_DIR=/mnt/backups/cassandra_backups/
Execute ~/cassandra-restore-extserver.sh SNAPSHOT_NAME
You will have to find out SNAPSHOT_NAME from listing of /mnt/backups/cassandra_backups. All the directories are of format HOST_SNAPSHOT_NAME
ls -alt /mnt/backups/cassandra_backups total 16 drwxr-xr-x 4 backupuser users 4096 Jun 28 03:06 . drwxr-xr-x 3 backupuser users 4096 Jun 28 03:06 doc-cp1-c1-m2-mgmt_cassandra-snp-2018-06-28-0306
root #~/cassandra-restore-extserver.sh cassandra-snp-2018-06-28-0306 receiving incremental file list ./ var/ var/cassandra/ var/cassandra/data/ var/cassandra/data/data/ var/cassandra/data/data/monasca/ var/cassandra/data/data/monasca/alarm_state_history-e6bbdc20488d11e8bdabc32666406af1/ var/cassandra/data/data/monasca/alarm_state_history-e6bbdc20488d11e8bdabc32666406af1/snapshots/ var/cassandra/data/data/monasca/alarm_state_history-e6bbdc20488d11e8bdabc32666406af1/snapshots/cassandra-snp-2018-06-28-0306/ var/cassandra/data/data/monasca/alarm_state_history-e6bbdc20488d11e8bdabc32666406af1/snapshots/cassandra-snp-2018-06-28-0306/manifest.json var/cassandra/data/data/monasca/alarm_state_history-e6bbdc20488d11e8bdabc32666406af1/snapshots/cassandra-snp-2018-06-28-0306/mc-37-big-CompressionInfo.db var/cassandra/data/data/monasca/alarm_state_history-e6bbdc20488d11e8bdabc32666406af1/snapshots/cassandra-snp-2018-06-28-0306/mc-37-big-Data.db ... ... ... /usr/bin/nodetool clearsnapshot monasca
13.2.1.2.11 Databases restore #
Databases restore
13.2.1.2.11.1 MariaDB database restore #
MariaDB database restore
Source the backup credentials file
ardana >source ~/backup.osrcList Freezer jobs
Gather the id of the job corresponding to the first controller and with the description. For example:
ardana >freezer job-list | grep "mysql restore from SSH" +----------------------------------+---------------------------------------------+-----------+---------+-----------+-------+------------+ | Job ID | Description | # Actions | Result | Status | Event | Session ID | +----------------------------------+---------------------------------------------+-----------+---------+-----------+-------+------------+ | 64715c6ce8ed40e1b346136083923260 | Ardana Default: mysql restore from SSH | 1 | | stop | | |ardana >freezer job-show 64715c6ce8ed40e1b346136083923260 +-------------+---------------------------------------------------------------------------------+ | Field | Value | +-------------+---------------------------------------------------------------------------------+ | Job ID | 64715c6ce8ed40e1b346136083923260 | | Client ID | doc-cp1-c1-m1-mgmt | | User ID | 33a6a77adc4b4799a79a4c3bd40f680d | | Session ID | | | Description | Ardana Default: mysql restore from SSH | | Actions | [{u'action_id': u'19dfb0b1851e41c682716ecc6990b25b', | | | u'freezer_action': {u'action': u'restore', | | | u'backup_name': u'freezer_mysql_backup', | | | u'container': u'/mnt/backups/freezer_database_backups', | | | u'hostname': u'doc-cp1-c1-m1-mgmt', | | | u'log_config_append': u'/etc/freezer/agent-logging.conf', | | | u'restore_abs_path': u'/tmp/mysql_restore/', | | | u'ssh_host': u'192.168.69.132', | | | u'ssh_key': u'/etc/freezer/ssh_key', | | | u'ssh_port': u'22', | | | u'ssh_username': u'backupuser', | | | u'storage': u'ssh'}, | | | u'max_retries': 5, | | | u'max_retries_interval': 60, | | | u'user_id': u'33a6a77adc4b4799a79a4c3bd40f680d'}] | | Start Date | | | End Date | | | Interval | | +-------------+---------------------------------------------------------------------------------+Start the job using its id
ardana >freezer job-start 64715c6ce8ed40e1b346136083923260 Start request sent for job 64715c6ce8ed40e1b346136083923260Wait for the job result to be success
ardana >freezer job-list | grep "mysql restore from SSH" +----------------------------------+---------------------------------------------+-----------+---------+-----------+-------+------------+ | Job ID | Description | # Actions | Result | Status | Event | Session ID | +----------------------------------+---------------------------------------------+-----------+---------+-----------+-------+------------+ | 64715c6ce8ed40e1b346136083923260 | Ardana Default: mysql restore from SSH | 1 | | running | | |ardana >freezer job-list | grep "mysql restore from SSH" +----------------------------------+---------------------------------------------+-----------+---------+-----------+-------+------------+ | Job ID | Description | # Actions | Result | Status | Event | Session ID | +----------------------------------+---------------------------------------------+-----------+---------+-----------+-------+------------+ | 64715c6ce8ed40e1b346136083923260 | Ardana Default: mysql restore from SSH | 1 | success | completed | | |Verify that the files have been restored on the controller
ardana >sudo du -shx /tmp/mysql_restore/* 16K /tmp/mysql_restore/aria_log.00000001 4.0K /tmp/mysql_restore/aria_log_control 3.4M /tmp/mysql_restore/barbican 8.0K /tmp/mysql_restore/ceilometer 4.2M /tmp/mysql_restore/cinder 2.9M /tmp/mysql_restore/designate 129M /tmp/mysql_restore/galera.cache 2.1M /tmp/mysql_restore/glance 4.0K /tmp/mysql_restore/grastate.dat 4.0K /tmp/mysql_restore/gvwstate.dat 2.6M /tmp/mysql_restore/heat 752K /tmp/mysql_restore/horizon 4.0K /tmp/mysql_restore/ib_buffer_pool 76M /tmp/mysql_restore/ibdata1 128M /tmp/mysql_restore/ib_logfile0 128M /tmp/mysql_restore/ib_logfile1 12M /tmp/mysql_restore/ibtmp1 16K /tmp/mysql_restore/innobackup.backup.log 313M /tmp/mysql_restore/keystone 716K /tmp/mysql_restore/magnum 12M /tmp/mysql_restore/mon 8.3M /tmp/mysql_restore/monasca_transform 0 /tmp/mysql_restore/multi-master.info 11M /tmp/mysql_restore/mysql 4.0K /tmp/mysql_restore/mysql_upgrade_info 14M /tmp/mysql_restore/nova 4.4M /tmp/mysql_restore/nova_api 14M /tmp/mysql_restore/nova_cell0 3.6M /tmp/mysql_restore/octavia 208K /tmp/mysql_restore/opsconsole 38M /tmp/mysql_restore/ovs_neutron 8.0K /tmp/mysql_restore/performance_schema 24K /tmp/mysql_restore/tc.log 4.0K /tmp/mysql_restore/test 8.0K /tmp/mysql_restore/winchester 4.0K /tmp/mysql_restore/xtrabackup_galera_infoRepeat steps 2-5 on the other two controllers where the MariaDB/Galera database is running, which can be determined by running below command on deployer
ardana >cd ~/scratch/ansible/next/ardana/ansibleardana >ansible FND-MDB --list-hostsStop SUSE OpenStack Cloud services on the three controllers (replace the hostnames of the controllers in the command)
ardana >cd ~/scratch/ansible/next/ardana/ansibleardana >ansible-playbook -i hosts/verb_hosts ardana-stop.yml --limit doc-cp1-c1-m1-mgmt,doc-cp1-c1-m2-mgmt,doc-cp1-c1-m3-mgmt,localhostClean the mysql directory and copy the restored backup on all three controllers where MariaDB/Galera database is running
root #cd /var/lib/mysql/root #rm -rf ./*root #cp -pr /tmp/mysql_restore/* ./Switch back to the ardana user once the copy is finished
13.2.1.2.11.2 Restart SUSE OpenStack Cloud services #
Restart SUSE OpenStack Cloud services
Restart the MariaDB Database
ardana >cd ~/scratch/ansible/next/ardana/ansibleardana >ansible-playbook -i hosts/verb_hosts galera-bootstrap.ymlOn the deployer node, execute the
galera-bootstrap.ymlplaybook which will automatically determine the log sequence number, bootstrap the main node, and start the database cluster.If this process fails to recover the database cluster, please refer to Section 13.2.2.1.2, “Recovering the MariaDB Database”. There Scenario 3 covers the process of manually starting the database.
Restart SUSE OpenStack Cloud services limited to the three controllers (replace the the hostnames of the controllers in the command).
ansible-playbook -i hosts/verb_hosts ardana-start.yml \ --limit doc-cp1-c1-m1-mgmt,doc-cp1-c1-m2-mgmt,doc-cp1-c1-m3-mgmt,localhost
Re-configure SUSE OpenStack Cloud
ansible-playbook -i hosts/verb_hosts ardana-reconfigure.yml
13.2.1.2.11.3 Re-enable SSH backups #
Re-enable SSH backups
Re-activate Freezer backup jobs
ardana >vi ~/openstack/my_cloud/config/freezer/activate_jobs.ymlardana >cat ~/openstack/my_cloud/config/freezer/activate_jobs.yml # If set to false, We wont create backups jobs. freezer_create_backup_jobs: true # If set to false, We wont create restore jobs. freezer_create_restore_jobs: trueSave the modifications in the GIT repository
cd ~/openstack/ardana/ansible/ git add -A git commit -a -m “Re-Activate SSH backup jobs” ansible-playbook -i hosts/localhost config-processor-run.yml ansible-playbook -i hosts/localhost ready-deployment.yml
Create Freezer jobs
cd ~/scratch/ansible/next/ardana/ansible ansible-playbook -i hosts/verb_hosts _freezer_manage_jobs.yml
13.2.1.2.12 Post restore testing #
Post restore testing
Source the service credential file
ardana >source ~/service.osrcSwift
ardana >swift list container_1 volumebackupsardana >swift list container_1 var/lib/ardana/backup.osrc var/lib/ardana/service.osrcardana >swift download container_1 /tmp/backup.osrcNeutron
ardana >openstack network list +--------------------------------------+---------------------+--------------------------------------+ | ID | Name | Subnets | +--------------------------------------+---------------------+--------------------------------------+ | 07c35d11-13f9-41d4-8289-fa92147b1d44 | test-net | 02d5ca3b-1133-4a74-a9ab-1f1dc2853ec8| +--------------------------------------+---------------------+--------------------------------------+Glance
ardana >openstack image list +--------------------------------------+----------------------+--------+ | ID | Name | Status | +--------------------------------------+----------------------+--------+ | 411a0363-7f4b-4bbc-889c-b9614e2da52e | cirros-0.4.0-x86_64 | active | +--------------------------------------+----------------------+--------+ardana >openstack image save --file /tmp/cirros f751c39b-f1e3-4f02-8332-3886826889baardana >ls -lah /tmp/cirros -rw-r--r-- 1 ardana ardana 12716032 Jul 2 20:52 /tmp/cirrosNova
ardana >openstack server listardana >openstack server listardana >openstack server create server_6 --image 411a0363-7f4b-4bbc-889c-b9614e2da52e --flavor m1.small --nic net-id=07c35d11-13f9-41d4-8289-fa92147b1d44 +-------------------------------------+------------------------------------------------------------+ | Field | Value | +-------------------------------------+------------------------------------------------------------+ | OS-DCF:diskConfig | MANUAL | | OS-EXT-AZ:availability_zone | | | OS-EXT-SRV-ATTR:host | None | | OS-EXT-SRV-ATTR:hypervisor_hostname | None | | OS-EXT-SRV-ATTR:instance_name | | | OS-EXT-STS:power_state | NOSTATE | | OS-EXT-STS:task_state | scheduling | | OS-EXT-STS:vm_state | building | | OS-SRV-USG:launched_at | None | | OS-SRV-USG:terminated_at | None | | accessIPv4 | | | accessIPv6 | | | addresses | | | adminPass | iJBoBaj53oUd | | config_drive | | | created | 2018-07-02T21:02:01Z | | flavor | m1.small (2) | | hostId | | | id | ce7689ff-23bf-4fe9-b2a9-922d4aa9412c | | image | cirros-0.4.0-x86_64 (f751c39b-f1e3-4f02-8332-3886826889ba) | | key_name | None | | name | server_6 | | progress | 0 | | project_id | cca416004124432592b2949a5c5d9949 | | properties | | | security_groups | name='default' | | status | BUILD | | updated | 2018-07-02T21:02:01Z | | user_id | 8cb1168776d24390b44c3aaa0720b532 | | volumes_attached | | +-------------------------------------+------------------------------------------------------------+ardana >openstack server list +--------------------------------------+----------+--------+---------------------------------+---------------------+-----------+ | ID | Name | Status | Networks | Image | Flavor | +--------------------------------------+----------+--------+---------------------------------+---------------------+-----------+ | ce7689ff-23bf-4fe9-b2a9-922d4aa9412c | server_6 | ACTIVE | n1=1.1.1.8 | cirros-0.4.0-x86_64 | m1.small |ardana >openstack server delete ce7689ff-23bf-4fe9-b2a9-922d4aa9412c
13.2.2 Unplanned Control Plane Maintenance #
Unplanned maintenance tasks for controller nodes such as recovery from power failure.
13.2.2.1 Restarting Controller Nodes After a Reboot #
Steps to follow if one or more of your controller nodes lose network connectivity or power, which includes if the node is either rebooted or needs hardware maintenance.
When a controller node is rebooted, needs hardware maintenance, loses network connectivity or loses power, these steps will help you recover the node.
These steps may also be used if the Host Status (ping) alarm is triggered for one or more of your controller nodes.
13.2.2.1.1 Prerequisites #
The following conditions must be true in order to perform these steps successfully:
Each of your controller nodes should be powered on.
Each of your controller nodes should have network connectivity, verified by SSH connectivity from the Cloud Lifecycle Manager to them.
The operator who performs these steps will need access to the lifecycle manager.
13.2.2.1.2 Recovering the MariaDB Database #
The recovery process for your MariaDB database cluster will depend on how many of your controller nodes need to be recovered. We will cover two scenarios:
Scenario 1: Recovering one or two of your controller nodes but not the entire cluster
Follow these steps to recover one or two of your controller nodes but not the entire cluster, then use these steps:
Ensure the controller nodes have power and are booted to the command prompt.
If the MariaDB service is not started, start it with this command:
sudo service mysql start
If MariaDB fails to start, proceed to the next section which covers the bootstrap process.
Scenario 2: Recovering the entire controller cluster with the bootstrap playbook
If the scenario above failed or if you need to recover your entire control plane cluster, use the process below to recover the MariaDB database.
Make sure no
mysqlddaemon is running on any node in the cluster before you continue with the steps in this procedure. If there is amysqlddaemon running, then use the command below to shut down the daemon.sudo systemctl stop mysql
If the mysqld daemon does not go down following the service stop, then kill the daemon using
kill -9before continuing.On the deployer node, execute the
galera-bootstrap.ymlplaybook which will automatically determine the log sequence number, bootstrap the main node, and start the database cluster.cd ~/scratch/ansible/next/ardana/ansible ansible-playbook -i hosts/verb_hosts galera-bootstrap.yml
13.2.2.1.3 Restarting Services on the Controller Nodes #
From the Cloud Lifecycle Manager you should execute the
ardana-start.yml playbook for each node that was brought
down so the services can be started back up.
If you have a dedicated (separate) Cloud Lifecycle Manager node you can use this syntax:
cd ~/scratch/ansible/next/ardana/ansible ansible-playbook -i hosts/verb_hosts ardana-start.yml --limit=<hostname_of_node>
If you have a shared Cloud Lifecycle Manager/controller setup and need to restart
services on this shared node, you can use localhost to
indicate the shared node, like this:
cd ~/scratch/ansible/next/ardana/ansible ansible-playbook -i hosts/verb_hosts ardana-start.yml --limit=<hostname_of_node>,localhost
If you leave off the --limit switch, the playbook will
be run against all nodes.
13.2.2.1.4 Restart the Monitoring Agents #
As part of the recovery process, you should also restart the
monasca-agent and these steps will show you how:
Log in to the Cloud Lifecycle Manager.
Stop the
monasca-agent:cd ~/scratch/ansible/next/ardana/ansible ansible-playbook -i hosts/verb_hosts monasca-agent-stop.yml
Restart the
monasca-agent:cd ~/scratch/ansible/next/ardana/ansible ansible-playbook -i hosts/verb_hosts monasca-agent-start.yml
You can then confirm the status of the
monasca-agentwith this playbook:cd ~/scratch/ansible/next/ardana/ansible ansible-playbook -i hosts/verb_hosts monasca-agent-status.yml
13.2.2.2 Recovering the Control Plane #
If one or more of your controller nodes has experienced data or disk corruption due to power loss or hardware failure and you need perform disaster recovery then we provide different scenarios for how to resolve them to get your cloud recovered.
If one or more of your controller nodes has experienced data or disk corruption due to power-loss or hardware failure and you need perform disaster recovery then we provide different scenarios for how to resolve them to get your cloud recovered.
You should have backed up /etc/group of the Cloud Lifecycle Manager
manually after installation. While recovering a Cloud Lifecycle Manager node, manually copy
the /etc/group file from a backup of the old Cloud Lifecycle Manager.
13.2.2.2.1 Point-in-Time MariaDB Database Recovery #
In this scenario, everything is still running (Cloud Lifecycle Manager, cloud controller nodes, and compute nodes) but you want to restore the MariaDB database to a previous state.
13.2.2.2.1.1 Restore from a Swift backup #
Log in to the Cloud Lifecycle Manager.
Determine which node is the first host member in the
FND-MDBgroup, which will be the first node hosting the MariaDB service in your cloud. You can do this by using these commands:ardana >cd ~/scratch/ansible/next/ardana/ansibleardana >grep -A1 FND-MDB--first-member hosts/verb_hostsThe result will be similar to the following example:
[FND-MDB--first-member:children] ardana002-cp1-c1-m1
In this example, the host name of the node is
ardana002-cp1-c1-m1Find the host IP address which will be used to log in.
ardana >cat /etc/hosts | grep ardana002-cp1-c1-m1 10.84.43.82 ardana002-cp1-c1-m1-extapi ardana002-cp1-c1-m1-extapi 192.168.24.21 ardana002-cp1-c1-m1-mgmt ardana002-cp1-c1-m1-mgmt 10.1.2.1 ardana002-cp1-c1-m1-guest ardana002-cp1-c1-m1-guest 10.84.65.3 ardana002-cp1-c1-m1-EXTERNAL-VM ardana002-cp1-c1-m1-external-vmIn this example,
192.168.24.21is the IP address for the host.SSH into the host.
ardana >ssh ardana@192.168.24.21Source the backup file.
ardana >source /var/lib/ardana/backup.osrcFind the
Client IDfor the host name from the beginning of this procedure (ardana002-cp1-c1-m1) in this example.ardana >freezer client-list +-----------------------------+----------------------------------+-----------------------------+-------------+ | Client ID | uuid | hostname | description | +-----------------------------+----------------------------------+-----------------------------+-------------+ | ardana002-cp1-comp0001-mgmt | f4d9cfe0725145fb91aaf95c80831dd6 | ardana002-cp1-comp0001-mgmt | | | ardana002-cp1-comp0002-mgmt | 55c93eb7d609467a8287f175a2275219 | ardana002-cp1-comp0002-mgmt | | | ardana002-cp1-c0-m1-mgmt | 50d26318e81a408e97d1b6639b9404b2 | ardana002-cp1-c0-m1-mgmt | | | ardana002-cp1-c1-m1-mgmt | 78fe921473914bf6a802ad360c09d35b | ardana002-cp1-c1-m1-mgmt | | | ardana002-cp1-c1-m2-mgmt | b2e9a4305c4b4272acf044e3f89d327f | ardana002-cp1-c1-m2-mgmt | | | ardana002-cp1-c1-m3-mgmt | a3ceb80b8212425687dd11a92c8bc48e | ardana002-cp1-c1-m3-mgmt | | +-----------------------------+----------------------------------+-----------------------------+-------------+In this example, the
hostnameand theClient IDare the same:ardana002-cp1-c1-m1-mgmt.List the jobs
ardana >freezer job-list -C CLIENT IDUsing the example in the previous step:
ardana >freezer job-list -C ardana002-cp1-c1-m1-mgmtGet the corresponding job id for
Ardana Default: mysql restore from Swift.Launch the restore process with:
ardana >freezer job-start JOB-IDThis will take some time. You can follow the progress by running
tail -f /var/log/freezer/freezer-scheduler.log. Wait until the restore job is finished before doing the next step.Log in to the Cloud Lifecycle Manager.
Stop the MariaDB service.
ardana >cd ~/scratch/ansible/next/ardana/ansibleardana >ansible-playbook -i hosts/verb_hosts percona-stop.ymlLog back in to the first node running the MariaDB service, the same node as in Step 3.
Clean the MariaDB directory using this command:
tux >sudo rm -r /var/lib/mysql/*Copy the restored files back to the MariaDB directory:
tux >sudo cp -pr /tmp/mysql_restore/* /var/lib/mysqlLog in to each of the other nodes in your MariaDB cluster, which were determined in Step 3. Remove the
grastate.datfile from each of them.tux >sudo rm /var/lib/mysql/grastate.datWarningDo not remove this file from the first node in your MariaDB cluster. Ensure you only do this from the other cluster nodes.
Log back in to the Cloud Lifecycle Manager.
Start the MariaDB service.
ardana >cd ~/scratch/ansible/next/ardana/ansibleardana >ansible-playbook -i hosts/verb_hosts galera-bootstrap.yml
13.2.2.2.1.2 Restore from an SSH backup #
Follow the same procedure as the one for Swift but select the job
Ardana Default: mysql restore from SSH.
13.2.2.2.1.3 Restore MariaDB manually #
If restoring MariaDB fails during the procedure outlined above, you can follow this procedure to manually restore MariaDB:
Log in to the Cloud Lifecycle Manager.
Stop the MariaDB cluster:
ardana >cd ~/scratch/ansible/next/ardana/ansibleardana >ansible-playbook -i hosts/verb_hosts percona-stop.ymlOn all of the nodes running the MariaDB service, which should be all of your controller nodes, run the following command to purge the old database:
tux >sudo rm -r /var/lib/mysql/*On the first node running the MariaDB service restore the backup with the command below. If you have already restored to a temporary directory, copy the files again.
tux >sudo cp -pr /tmp/mysql_restore/* /var/lib/mysqlIf you need to restore the files manually from SSH, follow these steps:
Create the
/root/mysql_restore.inifile with the contents below. Be careful to substitute the{{ values }}. Note that the SSH information refers to the SSH server you configured for backup before installing.[default] action = restore storage = ssh ssh_host = {{ freezer_ssh_host }} ssh_username = {{ freezer_ssh_username }} container = {{ freezer_ssh_base_dir }}/freezer_mysql_backup ssh_key = /etc/freezer/ssh_key backup_name = freezer_mysql_backup restore_abs_path = /var/lib/mysql/ log_file = /var/log/freezer-agent/freezer-agent.log hostname = {{ hostname of the first MariaDB node }}Execute the restore job:
ardana >freezer-agent --config /root/mysql_restore.ini
Log back in to the Cloud Lifecycle Manager.
Start the MariaDB service.
ardana >cd ~/scratch/ansible/next/ardana/ansibleardana >ansible-playbook -i hosts/verb_hosts galera-bootstrap.ymlAfter approximately 10-15 minutes, the output of the
percona-status.ymlplaybook should show all the MariaDB nodes in sync. MariaDB cluster status can be checked using this playbook:ardana >cd ~/scratch/ansible/next/ardana/ansibleardana >ansible-playbook -i hosts/verb_hosts percona-status.ymlAn example output is as follows:
TASK: [FND-MDB | status | Report status of "{{ mysql_service }}"] ************* ok: [ardana-cp1-c1-m1-mgmt] => { "msg": "mysql is synced." } ok: [ardana-cp1-c1-m2-mgmt] => { "msg": "mysql is synced." } ok: [ardana-cp1-c1-m3-mgmt] => { "msg": "mysql is synced." }
13.2.2.2.1.4 Point-in-Time Cassandra Recovery #
A node may have been removed either due to an intentional action in the Cloud Lifecycle Manager Admin UI or as a result of a fatal hardware event that requires a server to be replaced. In either case, the entry for the failed or deleted node should be removed from Cassandra before a new node is brought up.
The following steps should be taken before enabling and deploying the replacement node.
Determine the IP address of the node that was removed or is being replaced.
On one of the functional Cassandra control plane nodes, log in as the
ardanauser.Run the command
nodetool statusto display a list of Cassandra nodes.If the node that has been removed (no IP address matches that of the removed node) is not in the list, skip the next step.
If the node that was removed is still in the list, copy its node ID.
Run the command
nodetool removenode ID.
After any obsolete node entries have been removed, the replacement node can be deployed as usual (for more information, see Section 13.1.2, “Planned Control Plane Maintenance”). The new Cassandra node will be able to join the cluster and replicate data.
For more information, please consult the Cassandra documentation.
13.2.2.2.2 Point-in-Time Swift Rings Recovery #
In this situation, everything is still running (Cloud Lifecycle Manager, control plane nodes, and compute nodes) but you want to restore your Swift rings to a previous state.
Freezer backs up and restores Swift rings only, not Swift data.
13.2.2.2.2.1 Restore from a Swift backup #
Log in to the first Swift Proxy (
SWF-PRX[0]) node.To find the first Swift Proxy node:
On the Cloud Lifecycle Manager
ardana >cd ~/scratch/ansible/next/ardana/ansibleardana >ansible-playbook -i hosts/verb_hosts swift-status.yml \ --limit SWF-PRX[0]At the end of the output, you will see something like the following example:
... Jun 18 20:01:49 ardana-qe102-cp1-c1-m1-mgmt swiftlm-uptime-mon[3985]: 'uptime-mon - INFO : Metric:keystone-get-token:max-latency: 0.679254770279 (at 1529352109.66)' Jun 18 20:01:49 ardana-qe102-cp1-c1-m1-mgmt swiftlm-uptime-mon[3985]: 'uptime-mon - INFO : Metric:keystone-get-token:avg-latency: 0.679254770279 (at 1529352109.66)' PLAY RECAP ******************************************************************** ardana-qe102-cp1-c1-m1 : ok=12 changed=0 unreachable=0 failed=0```
Find the first node name and its IP address. For example:
ardana >cat /etc/hosts | grep ardana-qe102-cp1-c1-m1
Source the backup environment file:
ardana >source /var/lib/ardana/backup.osrcFind the client id.
ardana >freezer client-list +-----------------------------+----------------------------------+-----------------------------+-------------+ | Client ID | uuid | hostname | description | +-----------------------------+----------------------------------+-----------------------------+-------------+ | ardana002-cp1-comp0001-mgmt | f4d9cfe0725145fb91aaf95c80831dd6 | ardana002-cp1-comp0001-mgmt | | | ardana002-cp1-comp0002-mgmt | 55c93eb7d609467a8287f175a2275219 | ardana002-cp1-comp0002-mgmt | | | ardana002-cp1-c0-m1-mgmt | 50d26318e81a408e97d1b6639b9404b2 | ardana002-cp1-c0-m1-mgmt | | | ardana002-cp1-c1-m1-mgmt | 78fe921473914bf6a802ad360c09d35b | ardana002-cp1-c1-m1-mgmt | | | ardana002-cp1-c1-m2-mgmt | b2e9a4305c4b4272acf044e3f89d327f | ardana002-cp1-c1-m2-mgmt | | | ardana002-cp1-c1-m3-mgmt | a3ceb80b8212425687dd11a92c8bc48e | ardana002-cp1-c1-m3-mgmt | | +-----------------------------+----------------------------------+-----------------------------+-------------+In this example, the
hostnameand theClient IDare the same:ardana002-cp1-c1-m1-mgmt.List the jobs
ardana >freezer job-list -C CLIENT IDUsing the example in the previous step:
ardana >freezer job-list -C ardana002-cp1-c1-m1-mgmtGet the corresponding job id for
Ardana Default: swift restore from Swiftin theDescriptioncolumn.Launch the restore job:
ardana >freezer job-start JOB-IDThis will take some time. You can follow the progress by running
tail -f /var/log/freezer/freezer-scheduler.logWait until the restore job is finished before doing the next step.Log in to the Cloud Lifecycle Manager.
Stop the Swift service:
ardana >cd ~/scratch/ansible/next/ardana/ansibleardana >ansible-playbook -i hosts/verb_hosts swift-stop.ymlLog back in to the first Swift Proxy (
SWF-PRX[0]) node, which was determined in Step 1.Copy the restored files.
tux >sudo cp -pr /tmp/swift_builder_dir_restore/CLOUD_NAME/CONTROL_PLANE_NAME/builder_dir/* \ /etc/swiftlm/CLOUD_NAME/CONTROL_PLANE_NAME/builder_dir/For example
tux >sudo cp -pr /tmp/swift_builder_dir_restore/entry-scale-kvm/control-plane-1/builder_dir/* \ /etc/swiftlm/entry-scale-kvm/control-plane-1/builder_dir/Log back in to the Cloud Lifecycle Manager.
Reconfigure the Swift service:\
ardana >cd ~/scratch/ansible/next/ardana/ansibleardana >ansible-playbook -i hosts/verb_hosts swift-reconfigure.yml
13.2.2.2.2.2 Restore from an SSH backup #
Follow almost the same procedure as for Swift in the section immediately
preceding this one: Section 13.2.2.2.2.1, “Restore from a Swift backup”. The only change is
that the restore job uses a different job id. Get the corresponding job id
for Ardana Default: Swift restore from SSH in the
Description column.
13.2.2.2.3 Point-in-time Cloud Lifecycle Manager Recovery #
In this scenario, everything is still running (Cloud Lifecycle Manager, controller nodes, and compute nodes) but you want to restore the Cloud Lifecycle Manager to a previous state.
Log in to the Cloud Lifecycle Manager.
Source the backup environment file:
tux >source /var/lib/ardana/backup.osrcFind the
Client ID.tux >freezer client-list +-----------------------------+----------------------------------+-----------------------------+-------------+ | Client ID | uuid | hostname | description | +-----------------------------+----------------------------------+-----------------------------+-------------+ | ardana002-cp1-comp0001-mgmt | f4d9cfe0725145fb91aaf95c80831dd6 | ardana002-cp1-comp0001-mgmt | | | ardana002-cp1-comp0002-mgmt | 55c93eb7d609467a8287f175a2275219 | ardana002-cp1-comp0002-mgmt | | | ardana002-cp1-c0-m1-mgmt | 50d26318e81a408e97d1b6639b9404b2 | ardana002-cp1-c0-m1-mgmt | | | ardana002-cp1-c1-m1-mgmt | 78fe921473914bf6a802ad360c09d35b | ardana002-cp1-c1-m1-mgmt | | | ardana002-cp1-c1-m2-mgmt | b2e9a4305c4b4272acf044e3f89d327f | ardana002-cp1-c1-m2-mgmt | | | ardana002-cp1-c1-m3-mgmt | a3ceb80b8212425687dd11a92c8bc48e | ardana002-cp1-c1-m3-mgmt | | +-----------------------------+----------------------------------+-----------------------------+-------------+In this example, the
hostnameand theClient IDare the same:ardana002-cp1-c1-m1-mgmt.List the jobs
tux >freezer job-list -C CLIENT IDUsing the example in the previous step:
tux >freezer job-list -C ardana002-cp1-c1-m1-mgmtFind the correct job ID:
SSH Backups: Get the id corresponding to the job id for
Ardana Default: deployer restore from SSH.or
Swift Backups. Get the id corresponding to the job id for
Ardana Default: deployer restore from Swift.Stop the Dayzero UI:
tux >sudo systemctl stop dayzeroLaunch the restore job:
tux >freezer job-start JOB IDThis will take some time. You can follow the progress by running
tail -f /var/log/freezer/freezer-scheduler.log. Wait until the restore job is finished before doing the next step.Start the Dayzero UI:
tux >sudo systemctl start dayzero
13.2.2.2.4 Cloud Lifecycle Manager Disaster Recovery #
In this scenario everything is still running (controller nodes and compute nodes) but you have lost either a dedicated Cloud Lifecycle Manager or a shared Cloud Lifecycle Manager/controller node.
To ensure that you use the same version of SUSE OpenStack Cloud that you previously had loaded on your Cloud Lifecycle Manager, you will need to download and install the lifecycle management software using the instructions from the Book “Installing with Cloud Lifecycle Manager”, Chapter 3 “Installing the Cloud Lifecycle Manager server”, Section 3.5.2 “Installing the SUSE OpenStack Cloud Extension” before proceeding further.
13.2.2.2.4.1 Restore from a Swift backup #
Log in to the Cloud Lifecycle Manager.
Install the freezer-agent using the following playbook:
ardana >cd ~/openstack/ardana/ansible/ardana >ansible-playbook -i hosts/localhost _deployer_restore_helper.ymlAccess one of the other controller or compute nodes in your environment to perform the following steps:
Retrieve the
/var/lib/ardana/backup.osrcfile and copy it to the/var/lib/ardana/directory on the Cloud Lifecycle Manager.Copy all the files in the
/opt/stack/service/freezer-api/etc/directory to the same directory on the Cloud Lifecycle Manager.Copy all the files in the
/var/lib/ca-certificatesdirectory to the same directory on the Cloud Lifecycle Manager.Retrieve the
/etc/hostsfile and replace the one found on the Cloud Lifecycle Manager.
Log back in to the Cloud Lifecycle Manager.
Edit the value for
client_idin the following file to contain the hostname of your Cloud Lifecycle Manager:/opt/stack/service/freezer-api/etc/freezer-api.conf
Update your ca-certificates:
sudo update-ca-certificates
Edit the
/etc/hostsfile, ensuring you edit the 127.0.0.1 line so it points toardana:127.0.0.1 localhost ardana ::1 localhost ip6-localhost ip6-loopback ff02::1 ip6-allnodes ff02::2 ip6-allrouters
On the Cloud Lifecycle Manager, source the backup user credentials:
ardana >source ~/backup.osrcFind the
Client ID(ardana002-cp1-c0-m1-mgmt) for the host name as done in previous procedures (see Procedure 13.1, “Restoring from a Swift or SSH Backup”).ardana >freezer client-list +-----------------------------+----------------------------------+-----------------------------+-------------+ | Client ID | uuid | hostname | description | +-----------------------------+----------------------------------+-----------------------------+-------------+ | ardana002-cp1-comp0001-mgmt | f4d9cfe0725145fb91aaf95c80831dd6 | ardana002-cp1-comp0001-mgmt | | | ardana002-cp1-comp0002-mgmt | 55c93eb7d609467a8287f175a2275219 | ardana002-cp1-comp0002-mgmt | | | ardana002-cp1-c0-m1-mgmt | 50d26318e81a408e97d1b6639b9404b2 | ardana002-cp1-c0-m1-mgmt | | | ardana002-cp1-c1-m1-mgmt | 78fe921473914bf6a802ad360c09d35b | ardana002-cp1-c1-m1-mgmt | | | ardana002-cp1-c1-m2-mgmt | b2e9a4305c4b4272acf044e3f89d327f | ardana002-cp1-c1-m2-mgmt | | | ardana002-cp1-c1-m3-mgmt | a3ceb80b8212425687dd11a92c8bc48e | ardana002-cp1-c1-m3-mgmt | | +-----------------------------+----------------------------------+-----------------------------+-------------+In this example, the
hostnameand theClient IDare the same:ardana002-cp1-c0-m1-mgmt.List the Freezer jobs
ardana >freezer job-list -C CLIENT IDUsing the example in the previous step:
ardana >freezer job-list -C ardana002-cp1-c0-m1-mgmtGet the id of the job corresponding to
Ardana Default: deployer backup to Swift. Stop that job so the freezer scheduler does not begin making backups when started.ardana >freezer job-stop JOB-IDIf it is present, also stop the Cloud Lifecycle Manager's SSH backup.
Start the freezer scheduler:
sudo systemctl start openstack-freezer-scheduler
Get the id of the job corresponding to
Ardana Default: deployer restore from Swiftand launch that job:ardana >freezer job-start JOB-IDThis will take some time. You can follow the progress by running
tail -f /var/log/freezer/freezer-scheduler.log. Wait until the restore job is finished before doing the next step.When the job completes, the previous Cloud Lifecycle Manager contents should be restored to your home directory:
ardana >cd ~ardana >lsIf you are using Cobbler, restore your Cobbler configuration with these steps:
Remove the following files:
sudo rm -rf /var/lib/cobbler sudo rm -rf /srv/www/cobbler
Deploy Cobbler:
ardana >cd ~/openstack/ardana/ansibleardana >ansible-playbook -i hosts/localhost cobbler-deploy.ymlSet the
netboot-enabledflag for each of your nodes with this command:for h in $(sudo cobbler system list) do sudo cobbler system edit --name=$h --netboot-enabled=0 done
Update your deployment directory:
ardana >cd ~/openstack/ardana/ansibleardana >ansible-playbook -i hosts/localhost ready_deployment.ymlIf you are using a dedicated Cloud Lifecycle Manager, follow these steps:
re-run the deployment to ensure the Cloud Lifecycle Manager is in the correct state:
ardana >cd ~/scratch/ansible/next/ardana/ansibleardana >ansible-playbook -i hosts/verb_hosts site.yml --limit localhost
If you are using a shared Cloud Lifecycle Manager/controller, follow these steps:
If the node is also a Cloud Lifecycle Manager hypervisor, run the following commands to recreate the virtual machines that were lost:
ardana >cd ~/scratch/ansible/next/ardana/ansibleardana >ansible-playbook -i hosts/verb_hosts ardana-hypervisor-setup.yml --limit <this node>If the node that was lost (or one of the VMs that it hosts) was a member of the RabbitMQ cluster then you need to remove the record of the old node, by running the following command on any one of the other cluster members. In this example the nodes are called
cloud-cp1-rmq-mysql-m*-mgmtbut you need to use the correct names for your system, which you can find in/etc/hosts:ardana >cd ~/scratch/ansible/next/ardana/ansibleardana >ssh cloud-cp1-rmq-mysql-m3-mgmt sudo rabbitmqctl forget_cluster_node \ rabbit@cloud-cp1-rmq-mysql-m1-mgmtRun the
site.ymlagainst the complete cloud to reinstall and rebuild the services that were lost. If you replaced one of the RabbitMQ cluster members then you will need to add the-eflag shown below, to nominate a new master node for the cluster, otherwise you can omit it.ardana >cd ~/scratch/ansible/next/ardana/ansibleardana >ansible-playbook -i hosts/verb_hosts site.yml -e \ rabbit_primary_hostname=cloud-cp1-rmq-mysql-m3
13.2.2.2.4.2 Restore from an SSH backup #
On the Cloud Lifecycle Manager, edit the following file so it contains the same information as it did previously:
ardana >~/openstack/my_cloud/config/freezer/ssh_credentials.ymlOn the Cloud Lifecycle Manager, copy the following files, change directories, and run the playbook _deployer_restore_helper.yml:
ardana >cp -r ~/hp-ci/openstack/* ~/openstack/my_cloud/definition/ardana >cd ~/openstack/ardana/ansible/ardana >ansible-playbook -i hosts/localhost _deployer_restore_helper.ymlPerform the restore. First become root and change directories:
sudo su
root #cd /root/deployer_restore_helper/Execute the restore job:
ardana >./deployer_restore_script.shUpdate your deployment directory:
ardana >cd ~/openstack/ardana/ansibleardana >ansible-playbook -i hosts/localhost ready_deployment.ymlWhen the Cloud Lifecycle Manager is restored, re-run the deployment to ensure the Cloud Lifecycle Manager is in the correct state:
ardana >cd ~/scratch/ansible/next/ardana/ansibleardana >ansible-playbook -i hosts/verb_hosts site.yml --limit localhost
13.2.2.2.5 One or Two Controller Node Disaster Recovery #
This scenario makes the following assumptions:
Your Cloud Lifecycle Manager is still intact and working.
One or two of your controller nodes went down, but not the entire cluster.
The node needs to be rebuilt from scratch, not simply rebooted.
13.2.2.2.5.1 Steps to recovering one or two controller nodes #
Ensure that your node has power and all of the hardware is functioning.
Log in to the Cloud Lifecycle Manager.
Verify that all of the information in your
~/openstack/my_cloud/definition/data/servers.ymlfile is correct for your controller node. You may need to replace the existing information if you had to either replacement your entire controller node or just pieces of it.If you made changes to your
servers.ymlfile then commit those changes to your local git:ardana >git add -Aardana >git commit -a -m "editing controller information"Run the configuration processor:
ardana >cd ~/openstack/ardana/ansibleardana >ansible-playbook -i hosts/localhost config-processor-run.ymlUpdate your deployment directory:
ardana >cd ~/openstack/ardana/ansibleardana >ansible-playbook -i hosts/localhost ready-deployment.ymlEnsure that Cobbler has the correct system information:
If you replaced your controller node with a completely new machine, you need to verify that Cobbler has the correct list of controller nodes:
ardana >sudo cobbler system listRemove any controller nodes from Cobbler that no longer exist:
ardana >sudo cobbler system remove --name=<node>Add the new node into Cobbler:
ardana >cd ~/openstack/ardana/ansibleardana >ansible-playbook -i hosts/localhost cobbler-deploy.yml
Then you can image the node:
ardana >cd ~/openstack/ardana/ansibleardana >ansible-playbook -i hosts/localhost bm-reimage.yml -e nodelist=<node_name>NoteIf you do not know the
<node name>already, you can get it by usingsudo cobbler system list.Before proceeding, you may want to take a look at info/server_info.yml to see if the assignment of the node you have added is what you expect. It may not be, as nodes will not be numbered consecutively if any have previously been removed. This is to prevent loss of data; the config processor retains data about removed nodes and keeps their ID numbers from being reallocated. See the Persisted Server Allocations section in for information on how this works.
[OPTIONAL] - Run the
wipe_disks.ymlplaybook to ensure all of your non-OS partitions on your nodes are completely wiped prior to continuing with the installation. Thewipe_disks.ymlplaybook is only meant to be run on systems immediately after runningbm-reimage.yml. If used for any other case, it may not wipe all of the expected partitions.ardana >cd ~/scratch/ansible/next/ardana/ansible/ardana >ansible-playbook -i hosts/verb_hosts wipe_disks.yml --limit <controller_node_hostname>Complete the rebuilding of your controller node with the two playbooks below:
ardana >cd ~/scratch/ansible/next/ardana/ansibleardana >ansible-playbook -i hosts/verb_hosts osconfig-run.yml -e rebuild=True --limit=<controller_node_hostname>ardana >ansible-playbook -i hosts/verb_hosts ardana-deploy.yml -e rebuild=True --limit=<controller_node_hostname>
13.2.2.2.6 Three Control Plane Node Disaster Recovery #
In this scenario, all control plane nodes are destroyed which need to be rebuilt or replaced.
13.2.2.2.6.1 Restore from a Swift backup: #
Restoring from a Swift backup is not possible because Swift is gone.
13.2.2.2.6.2 Restore from an SSH backup #
Log in to the Cloud Lifecycle Manager.
Disable the default backup job(s) by editing the following file:
ardana >~/scratch/ansible/next/ardana/ansible/roles/freezer-jobs/defaults/activate.ymlSet the value for
freezer_create_backup_jobstofalse:# If set to false, We won't create backups jobs. freezer_create_backup_jobs: false
Deploy the control plane nodes, using the values for your control plane node hostnames:
ardana >cd ~/scratch/ansible/next/ardana/ansibleardana >ansible-playbook -i hosts/verb_hosts site.yml --limit \ CONTROL_PLANE_HOSTNAME1,CONTROL_PLANE_HOSTNAME2, \ CONTROL_PLANE_HOSTNAME3 -e rebuild=TrueFor example, if you were using the default values from the example model files your command would look like this:
ardana >ansible-playbook -i hosts/verb_hosts site.yml \ --limit ardana-ccp-c1-m1-mgmt,ardana-ccp-c1-m2-mgmt,ardana-ccp-c1-m3-mgmt \ -e rebuild=TrueNoteThe
-e rebuild=Trueis only used on a single control plane node when there are other controllers available to pull configuration data from. This will cause the MariaDB database to be reinitialized, which is the only choice if there are no additional control nodes.Restore the MariaDB backup on the first controller node.
List the Freezer jobs:
ardana >cd ~/scratch/ansible/next/ardana/ansibleardana >freezer job-list -C FIRST_CONTROLLER_NODERun the
Ardana Default: mysql restore from SSHjob for your first controller node, replacing theJOB_IDfor that job:ardana >freezer job-start JOB_ID
You can monitor the restore job by connecting to your first controller node via SSH and running the following commands:
ardana >ssh FIRST_CONTROLLER_NODEardana >sudo suroot #tail -n 100 /var/log/freezer/freezer-scheduler.logLog back in to the Cloud Lifecycle Manager.
Stop MySQL:
ardana >cd ~/scratch/ansible/next/ardana/ansibleardana >ansible-playbook -i hosts/verb_hosts percona-stop.ymlLog back in to the first controller node and move the following files:
ardana >ssh FIRST_CONTROLLER_NODEardana >sudo suroot #rm -rf /var/lib/mysql/*root #cp -pr /tmp/mysql_restore/* /var/lib/mysql/Log back in to the Cloud Lifecycle Manager and bootstrap MySQL:
ardana >cd ~/scratch/ansible/next/ardana/ansibleardana >ansible-playbook -i hosts/verb_hosts galera-bootstrap.ymlVerify the status of MySQL:
ardana >cd ~/scratch/ansible/next/ardana/ansibleardana >ansible-playbook -i hosts/verb_hosts percona-status.ymlRe-enable the default backup job(s) by editing the following file:
~/scratch/ansible/next/ardana/ansible/roles/freezer-jobs/defaults/activate.yml
Set the value for
freezer_create_backup_jobstotrue:# If set to false, We won't create backups jobs. freezer_create_backup_jobs: true
Run this playbook to deploy the backup jobs:
ardana >cd ~/scratch/ansible/next/ardana/ansibleardana >ansible-playbook -i hosts/verb_hosts _freezer_manage_jobs.yml
13.2.2.2.7 Swift Rings Recovery #
To recover your Swift rings in the event of a disaster, follow the procedure that applies to your situation: either recover the rings from one Swift node if possible, or use the SSH backup that you have set up.
To recover your Swift rings in the event of a disaster, follow the procedure that applies to your situation: either recover the rings from one Swift node if possible, or use the SSH backup that you have set up.
13.2.2.2.7.1 Restore from the Swift deployment backup #
13.2.2.2.7.2 Restore from the SSH Freezer backup #
In the very specific use case where you lost all system disks of all object nodes, and Swift proxy nodes are corrupted, you can recover the rings because a copy of the Swift rings is stored in Freezer. This means that Swift data is still there (the disks used by Swift needs to be still accessible).
Recover the rings with these steps.
Log in to a node that has the freezer-agent installed.
Become root:
ardana >sudo suCreate the temporary directory to restore your files to:
root #mkdir /tmp/swift_builder_dir_restore/Create a restore file with the following content:
root #cat << EOF > ./restore_config.ini [default] action = restore storage = ssh compression = bzip2 restore_abs_path = /tmp/swift_builder_dir_restore/ ssh_key = /etc/freezer/ssh_key ssh_host = <freezer_ssh_host> ssh_port = <freezer_ssh_port> ssh_user name = <freezer_ssh_user name> container = <freezer_ssh_base_rid>/freezer_swift_backup_name = freezer_swift_builder_backup hostname = <hostname of the old first Swift-Proxy (SWF-PRX[0])> EOFEdit the file and replace all <tags> with the right information.
vim ./restore_config.ini
You will also need to put the SSH key used to do the backups in /etc/freezer/ssh_key and remember to set the right permissions: 600.
Execute the restore job:
root #freezer-agent --config ./restore_config.iniYou now have the Swift rings in
/tmp/swift_builder_dir_restore/If the SWF-PRX[0] is already deployed, copy the contents of the restored directory (
/tmp/swift_builder_dir_restore/CLOUD_NAME/CONTROL_PLANE_NAME/builder_dir/) to/etc/swiftlm/CLOUD_NAME/CONTROL_PLANE_NAME/builder_dir/on the SWF-PRX[0] Then from the Cloud Lifecycle Manager run:ardana >sudo cp -pr /tmp/swift_builder_dir_restore/CLOUD_NAME/CONTROL_PLANE_NAME/builder_dir/* \ /etc/swiftlm/CLOUD_NAME/CONTROL_PLANE_NAME/builder_dir/For example
ardana >sudo cp -pr /tmp/swift_builder_dir_restore/entry-scale-kvm/control-plane-1/builder_dir/* \ /etc/swiftlm/entry-scale-kvm/control-plane-1/builder_dir/ardana >cd ~/scratch/ansible/next/ardana/ansibleardana >ansible-playbook -i hosts/verb_hosts swift-reconfigure.ymlIf the SWF-ACC[0] is not deployed, from the Cloud Lifecycle Manager run these playbooks:
ardana >cd ~/scratch/ansible/next/ardana/ansibleardana >ansible-playbook -i hosts/verb_hosts guard-deployment.ymlardana >ansible-playbook -i hosts/verb_hosts osconfig-run.yml --limit <SWF-ACC[0]-hostname>Copy the contents of the restored directory (
/tmp/swift_builder_dir_restore/CLOUD_NAME/CONTROL_PLANE_NAME/builder_dir/) to/etc/swiftlm/CLOUD_NAME/CONTROL_PLANE_NAME/builder_dir/on the SWF-ACC[0] You will have to create the directories :/etc/swiftlm/CLOUD_NAME/CONTROL_PLANE_NAME/builder_dir/ardana >sudo cp -pr /tmp/swift_builder_dir_restore/CLOUD_NAME/CONTROL_PLANE_NAME/builder_dir/* \ /etc/swiftlm/CLOUD_NAME/CONTROL_PLANE_NAME/builder_dir/For example
ardana >sudo cp -pr /tmp/swift_builder_dir_restore/entry-scale-kvm/control-plane-1/builder_dir/* \ /etc/swiftlm/entry-scale-kvm/control-plane-1/builder_dir/From the Cloud Lifecycle Manager, run the
ardana-deploy.ymlplaybook:ardana >cd ~/scratch/ansible/next/ardana/ansibleardana >ansible-playbook -i hosts/verb_hosts ardana-deploy.yml
13.2.3 Unplanned Compute Maintenance #
Unplanned maintenance tasks including recovering compute nodes.
13.2.3.1 Recovering a Compute Node #
If one or more of your compute nodes has experienced an issue such as power loss or hardware failure, then you need to perform disaster recovery. Here we provide different scenarios and how to resolve them to get your cloud repaired.
Typical scenarios in which you will need to recover a compute node include the following:
The node has failed, either because it has shut down has a hardware failure, or for another reason.
The node is working but the
nova-computeprocess is not responding, thus instances are working but you cannot manage them (for example to delete, reboot, and attach/detach volumes).The node is fully operational but monitoring indicates a potential issue (such as disk errors) that require down time to fix.
13.2.3.1.1 What to do if your compute node is down #
Compute node has power but is not powered on
If your compute node has power but is not powered on, use these steps to restore the node:
Log in to the Cloud Lifecycle Manager.
Obtain the name for your compute node in Cobbler:
sudo cobbler system list
Power the node back up with this playbook, specifying the node name from Cobbler:
cd ~/openstack/ardana/ansible ansible-playbook -i hosts/localhost bm-power-up.yml -e nodelist=<node name>
Compute node is powered on but services are not running on it
If your compute node is powered on but you are unsure if services are running, you can use these steps to ensure that they are running:
Log in to the Cloud Lifecycle Manager.
Confirm the status of the compute service on the node with this playbook:
cd ~/scratch/ansible/next/ardana/ansible ansible-playbook -i hosts/verb_hosts nova-status.yml --limit <hostname>
You can start the compute service on the node with this playbook:
cd ~/scratch/ansible/next/ardana/ansible ansible-playbook -i hosts/verb_hosts nova-start.yml --limit <hostname>
13.2.3.1.2 Scenarios involving disk failures on your compute nodes #
Your compute nodes should have a minimum of two disks, one that is used for
the operating system and one that is used as the data disk. These are
defined during the installation of your cloud, in the
~/openstack/my_cloud/definition/data/disks_compute.yml file
on the Cloud Lifecycle Manager. The data disk(s) are where the
nova-compute service lives. Recovery scenarios will
depend on whether one or the other, or both, of these disks experienced
failures.
If your operating system disk failed but the data disk(s) are okay
If you have had issues with the physical volume that nodes your operating system you need to ensure that your physical volume is restored and then you can use the following steps to restore the operating system:
Log in to the Cloud Lifecycle Manager.
Source the administrator credentials:
source ~/service.osrc
Obtain the hostname for your compute node, which you will use in subsequent commands when
<hostname>is requested:nova host-list | grep compute
Obtain the status of the
nova-computeservice on that node:nova service-list --host <hostname>
You will likely want to disable provisioning on that node to ensure that
nova-schedulerdoes not attempt to place any additional instances on the node while you are repairing it:nova service-disable --reason "node is being rebuilt" <hostname> nova-compute
Obtain the status of the instances on the compute node:
nova list --host <hostname> --all-tenants
Before continuing, you should either evacuate all of the instances off your compute node or shut them down. If the instances are booted from volumes, then you can use the
nova evacuateornova host-evacuatecommands to do this. See Section 13.1.3.3, “Live Migration of Instances” for more details on how to do this.If your instances are not booted from volumes, you will need to stop the instances using the
nova stopcommand. Because thenova-computeservice is not running on the node you will not see the instance status change, but theTask Statefor the instance should change topowering-off.nova stop <instance_uuid>
Verify the status of each of the instances using these commands, verifying the
Task Statestatespowering-off:nova list --host <hostname> --all-tenants nova show <instance_uuid>
At this point you should be ready with a functioning hard disk in the node that you can use for the operating system. Follow these steps:
Obtain the name for your compute node in Cobbler, which you will use in subsequent commands when
<node_name>is requested:sudo cobbler system list
Reimage the compute node with this playbook:
cd ~/openstack/ardana/ansible ansible-playbook -i hosts/localhost bm-reimage.yml -e nodelist=<node name>
Once reimaging is complete, use the following playbook to configure the operating system and start up services:
cd ~/scratch/ansible/next/ardana/ansible ansible-playbook -i hosts/verb_hosts site.yml --limit <hostname>
You should then ensure any instances on the recovered node are in an
ACTIVEstate. If they are not then use thenova startcommand to bring them to theACTIVEstate:nova list --host <hostname> --all-tenants nova start <instance_uuid>
Reenable provisioning:
nova service-enable <hostname> nova-compute
Start any instances that you had stopped previously:
nova list --host <hostname> --all-tenants nova start <instance_uuid>
If your data disk(s) failed but the operating system disk is okay OR if all drives failed
In this scenario your instances on the node are lost. First, follow steps 1 to 5 and 8 to 9 in the previous scenario.
After that is complete, use the nova rebuild command to
respawn your instances, which will also ensure that they receive the same IP
address:
nova list --host <hostname> --all-tenants nova rebuild <instance_uuid>
13.2.4 Unplanned Storage Maintenance #
Unplanned maintenance tasks for storage nodes.
13.2.4.1 Unplanned Swift Storage Maintenance #
Unplanned maintenance tasks for Swift storage nodes.
13.2.4.1.1 Recovering a Swift Node #
If one or more of your Swift Object or PAC nodes has experienced an issue, such as power loss or hardware failure, and you need to perform disaster recovery then we provide different scenarios and how to resolve them to get your cloud repaired.
Typical scenarios in which you will need to repair a Swift object or PAC node include:
The node has either shut down or been rebooted.
The entire node has failed and needs to be replaced.
A disk drive has failed and must be replaced.
13.2.4.1.1.1 What to do if your Swift host has shut down or rebooted #
If your Swift host has power but is not powered on, from the lifecycle manager you can run this playbook:
Log in to the Cloud Lifecycle Manager.
Obtain the name for your Swift host in Cobbler:
sudo cobbler system list
Power the node back up with this playbook, specifying the node name from Cobbler:
cd ~/openstack/ardana/ansible ansible-playbook -i hosts/localhost bm-power-up.yml -e nodelist=<node name>
Once the node is booted up, Swift should start automatically. You can verify this with this playbook:
cd ~/scratch/ansible/next/ardana/ansible ansible-playbook -i hosts/verb_hosts swift-status.yml
Any alarms that have triggered due to the host going down should clear within 10 minutes. See Section 15.1.1, “Alarm Resolution Procedures” if further assistance is needed with the alarms.
13.2.4.1.1.2 How to replace your Swift node #
If your Swift node has irreparable damage and you need to replace the entire node in your environment, see Section 13.1.5.1.5, “Replacing a Swift Node” for details on how to do this.
13.2.4.1.1.3 How to replace a hard disk in your Swift node #
If you need to do a hard drive replacement in your Swift node, see Section 13.1.5.1.6, “Replacing Drives in a Swift Node” for details on how to do this.
13.3 Cloud Lifecycle Manager Maintenance Update Procedure #
Ensure that the update repositories have been properly set up on all nodes. The easiest way to provide the required repositories on the Cloud Lifecycle Manager Server is to set up an SMT server as described in Book “Installing with Cloud Lifecycle Manager”, Chapter 4 “Installing and Setting Up an SMT Server on the Cloud Lifecycle Manager server (Optional)”. Alternatives to setting up an SMT server are described in Book “Installing with Cloud Lifecycle Manager”, Chapter 5 “Software Repository Setup”.
Read the Release Notes for the security and maintenance updates that will be installed.
Have a backup strategy in place. For further information, see Chapter 14, Backup and Restore.
Ensure that you have a known starting state by resolving any unexpected alarms.
Determine if you need to reboot your cloud after updating the software. Rebooting is highly recommended to ensure that all affected services are restarted. Reboot may be required after installing Linux kernel updates, but it can be skipped if the impact on running services is non-existent or well understood.
Review steps in Section 13.1.4.1, “Adding a Neutron Network Node” and Section 13.1.1.2, “Rolling Reboot of the Cloud” to minimize the impact on existing workloads. These steps are critical when the Neutron services are not provided via external SDN controllers.
Before the update, prepare your working loads by consolidating all of your instances to one or more Compute Nodes. After the update is complete on the 324 evacuated Compute Nodes, reboot them and move the images from the remaining Compute Nodes to the newly booted ones. Then, update the remaining Compute Nodes.
13.3.1 Performing the Update #
Before you proceed, get the status of all your services:
ardana >cd ~/scratch/ansible/next/ardana/ansibleardana >ansible-playbook -i hosts/verb_hosts ardana-status.yml
If status check returns an error for a specific service, run the
SERVICE-reconfigure.yml
playbook. Then run the
SERVICE-status.yml
playbook to check that the issue has been resolved.
Update and reboot all nodes in the cloud one by one. Start with the deployer node, then follow the order recommended in Section 13.1.1.2, “Rolling Reboot of the Cloud”.
The described workflow also covers cases in which the deployer node is also provisioned as an active cloud node.
To minimize the impact on the existing workloads, the node should first be prepared for an update and a subsequent reboot by following the steps leading up to stopping services listed in Section 13.1.1.2, “Rolling Reboot of the Cloud”, such as migrating singleton agents on Control Nodes and evacuating Compute Nodes. Do not stop services running on the node, as they need to be running during the update.
Install all available security and maintenance updates on the deployer using the
zypper patchcommand.Initialize the Cloud Lifecycle Manager and prepare the update playbooks.
Run the
ardana-initinitialization script to update the deployer.Redeploy cobbler:
ardana >cd ~/openstack/ardana/ansibleardana >ansible-playbook -i hosts/localhost cobbler-deploy.ymlRun the configuration processor:
ardana >cd ~/openstack/ardana/ansibleardana >ansible-playbook -i hosts/localhost config-processor-run.ymlUpdate your deployment directory:
ardana >cd ~/openstack/ardana/ansibleardana >ansible-playbook -i hosts/localhost ready-deployment.yml
Installation and management of updates can be automated with the following playbooks:
ardana-update-pkgs.ymlardana-update.ymlardana-update-status.ymlImportantSome playbooks are being deprecated. To determine how your system is affected, run:
ardana >rpm -qa ardana-ansibleThe result will be
ardana-ansible-8.0+git.followed by a version number string.If the first part of the version number string is greater than or equal to 1553878455 (for example, ardana-ansible-8.0+git.1553878455.7439e04), use the newly introduced parameters:
pending_clm_updatepending_service_updatepending_system_reboot
If the first part of the version number string is less than 1553878455 (for example, ardana-ansible-8.0+git.1552032267.5298d45), use the following parameters:
update_status_varupdate_status_setupdate_status_reset
ardana-reboot.yml
Confirm version changes by running
hostnamectlbefore and after running theardana-update-pkgsplaybook on each node.ardana >hostnamectlNotice that the
Boot ID:andKernel:information has changed.By default, the
ardana-update-pkgs.ymlplaybook will install patches and updates that do not require a system reboot. Patches and updates that do require a system reboot will be installed later in this process.ardana >cd ~/scratch/ansible/next/ardana/ansibleardana >ansible-playbook -i hosts/verb_hosts ardana-update-pkgs.yml \ --limit TARGET_NODE_NAMEThere may be a delay in the playbook output at the following task while updates are pulled from the deployer.
TASK: [ardana-upgrade-tools | pkg-update | Download and install package updates] ***
After running the
ardana-update-pkgs.ymlplaybook to install patches and updates not requiring reboot, check the status of remaining tasks.ardana >ansible-playbook -i hosts/verb_hosts ardana-update-status.yml \ --limit TARGET_NODE_NAMETo install patches that require reboot, run the
ardana-update-pkgs.ymlplaybook with the parameter-e zypper_update_include_reboot_patches=true.ardana >ansible-playbook -i hosts/verb_hosts ardana-update-pkgs.yml \ --limit TARGET_NODE_NAME \ -e zypper_update_include_reboot_patches=trueIf the output of
ardana-update-pkgs.ymlindicates that a reboot is required, runardana-reboot.ymlafter completing theardana-update.ymlstep below. Runningardana-reboot.ymlwill cause cloud service interruption.NoteTo update a single package (for example, apply a PTF on a single node or on all nodes), run
zypper update PACKAGE.To install all package updates using
zypper update.Update services:
ardana >ansible-playbook -i hosts/verb_hosts ardana-update.yml \ --limit TARGET_NODE_NAMEIf indicated by the
ardana-update-status.ymlplaybook, reboot the node.There may also be a warning to reboot after running the
ardana-update-pkgs.yml.This check can be overridden by setting the
SKIP_UPDATE_REBOOT_CHECKSenvironment variable or theskip_update_reboot_checksAnsible variable.ansible-playbook -i hosts/verb_hosts ardana-reboot.yml \ --limit TARGET_NODE_NAME
To recheck pending system reboot status at a later time, run the following commands:
ardana >cd ~/scratch/ansible/next/ardana/ansibleardana >ansible-playbook -i hosts/verb_hosts ardana-update-status.yml \ --limit ardana-cp1-c1-m2The pending system reboot status can be reset by running:
ardana >cd ~/scratch/ansible/next/ardana/ansibleardana >ansible-playbook -i hosts/verb_hosts ardana-update-status.yml \ --limit ardana-cp1-c1-m2 \ -e pending_system_reboot=offMultiple servers can be patched at the same time with
ardana-update-pkgs.ymlby setting the option-e skip_single_host_checks=true.WarningWhen patching multiple servers at the same time, take care not to compromise HA capability by updating an entire cluster (controller, database, monitor, logging) at the same time.
If multiple nodes are specified on the command line (with
--limit), services on those servers will experience outages as the packages are shutdown and updated. On Compute Nodes (or group of Compute Nodes) migrate the workload off if you plan to update it. The same applies to Control Nodes: move singleton services off of the control plane node that will be updated.ImportantDo not reboot all of your controllers at the same time.
When the node comes up after the reboot, run the
spark-start.ymlfile:ardana >cd ~/scratch/ansible/next/ardana/ansibleardana >ansible-playbook -i hosts/verb_hosts spark-start.ymlVerify that Spark is running on all Control Nodes:
ardana >cd ~/scratch/ansible/next/ardana/ansibleardana >ansible-playbook -i hosts/verb_hosts spark-status.ymlAfter all nodes have been updated, check the status of all services:
ardana >cd ~/scratch/ansible/next/ardana/ansibleardana >ansible-playbook -i hosts/verb_hosts ardana-status.yml
13.3.2 Summary of the Update Playbooks #
- ardana-update-pkgs.yml
Top-level playbook automates the installation of package updates on a single node. It also works for multiple nodes, if the single-node restriction is overridden by setting the SKIP_SINGLE_HOST_CHECKS environment variable
ardana-update-pkgs.yml -e skip_single_host_checks=true.Provide the following
-eoptions to modify default behavior:zypper_update_method(default: patch)patchwill install all patches for the system. Patches are intended for specific bug and security fixes.updatewill install all packages that have a higher version number than the installed packages.dist-upgradereplaces each package installed with the version from the repository and deletes packages not available in the repositories.
zypper_update_repositories(default: all) restricts the list of repositories usedzypper_update_gpg_checks(default: true) enables GPG checks. If set totrue, checks if packages are correctly signed.zypper_update_licenses_agree(default: false) automatically agrees with licenses. If set totrue, zypper automatically accepts third party licenses.zypper_update_include_reboot_patches(default: false) includes patches that require reboot. Setting this totrueinstalls patches that require a reboot (such as kernel or glibc updates).
- ardana-update.yml
Top level playbook that automates the update of all the services. Runs on all nodes by default, or can be limited to a single node by adding
--limit nodename.- ardana-reboot.yml
Top-level playbook that automates the steps required to reboot a node. It includes pre-boot and post-boot phases, which can be extended to include additional checks.
- ardana-update-status.yml
This playbook can be used to check or reset the update-related status variables maintained by the update playbooks. The main reason for having this mechanism is to allow the update status to be checked at any point during the update procedure. It is also used heavily by the automation scripts to orchestrate installing maintenance updates on multiple nodes.
13.4 Cloud Lifecycle Manager Program Temporary Fix (PTF) Deployment #
Occasionally, in order to fix a given issue, SUSE will provide a set of packages known as a Program Temporary Fix (PTF). Such a PTF is fully supported by SUSE until the Maintenance Update containing a permanent fix has been released via the regular Update repositories. Customers running PTF fixes will be notified through the related Service Request when a permanent patch for a PTF has been released.
Use the following steps to deploy a PTF:
When SUSE has developed a PTF, you will receive a URL for that PTF. You should download the packages from the location provided by SUSE Support to a temporary location on the Cloud Lifecycle Manager. For example:
ardana >tmpdir=`mktemp -d`ardana >cd $tmpdirardana >sudo wget --no-directories --recursive --reject "index.html*"\ --user=USER_NAME \ --password=PASSWORD \ --no-parent https://ptf.suse.com/54321aaaa...dddd12345/cloud8/042171/x86_64/20181030Remove any old data from the PTF repository, such as a listing for a PTF repository from a migration or when previous product patches were installed.
ardana >sudo rm -rf /srv/www/suse-12.3/x86_64/repos/PTF/*Move packages from the temporary download location to the PTF repository directory on the CLM Server. This example is for a Neutron PTF.
ardana >sudo mkdir -p /srv/www/suse-12.3/x86_64/repos/PTF/ardana >sudo mv $tmpdir/* /srv/www/suse-12.3/x86_64/repos/PTF/ardana >sudo chown --recursive root:root /srv/www/suse-12.3/x86_64/repos/PTF/*ardana >rmdir $tmpdirCreate or update the repository metadata:
ardana >sudo /usr/local/sbin/createrepo-cloud-ptf Spawning worker 0 with 2 pkgs Workers Finished Saving Primary metadata Saving file lists metadata Saving other metadataRefresh the PTF repository before installing package updates on the Cloud Lifecycle Manager
ardana >sudo zypper refresh --force --repo PTF Forcing raw metadata refresh Retrieving repository 'PTF' metadata ..........................................[d one] Forcing building of repository cache Building repository 'PTF' cache ..........................................[done] Specified repositories have been refreshed.The PTF shows as available on the deployer.
ardana >sudo zypper se --repo PTF Loading repository data... Reading installed packages... S | Name | Summary | Type --+-------------------------------+-----------------------------------------+-------- | python-neutronclient | Python API and CLI for OpenStack Neutron | package i | venv-openstack-neutron-x86_64 | Python virtualenv for OpenStack Neutron | packageInstall the PTF venv packages on the Cloud Lifecycle Manager
ardana >sudo zypper dup --from PTF Refreshing service Loading repository data... Reading installed packages... Computing distribution upgrade... The following package is going to be upgraded: venv-openstack-neutron-x86_64 The following package has no support information from its vendor: venv-openstack-neutron-x86_64 1 package to upgrade. Overall download size: 64.2 MiB. Already cached: 0 B. After the operation, additional 6.9 KiB will be used. Continue? [y/n/...? shows all options] (y): y Retrieving package venv-openstack-neutron-x86_64-11.0.2-13.8.1.042171.0.PTF.102473.noarch ... (1/1), 64.2 MiB ( 64.6 MiB unpacked) Retrieving: venv-openstack-neutron-x86_64-11.0.2-13.8.1.042171.0.PTF.102473.noarch.rpm ....[done] Checking for file conflicts: ..............................................................[done] (1/1) Installing: venv-openstack-neutron-x86_64-11.0.2-13.8.1.042171.0.PTF.102473.noarch ....[done] Additional rpm output: warning warning: /var/cache/zypp/packages/PTF/noarch/venv-openstack-neutron-x86_64-11.0.2-13.8.1.042171.0.PTF.102473.noarch.rpm: Header V3 DSA/SHA1 Signature, key ID b37b98a9: NOKEYValidate the venv tarball has been installed into the deployment directory:(note:the packages file under that dir shows the registered tarballs that will be used for the services, which should align with the installed venv RPM)
ardana >ls -la /opt/ardana_packager/ardana-8/sles_venv/x86_64 total 898952 drwxr-xr-x 2 root root 4096 Oct 30 16:10 . ... -rw-r--r-- 1 root root 67688160 Oct 30 12:44 neutron-20181030T124310Z.tgz <<< -rw-r--r-- 1 root root 64674087 Aug 14 16:14 nova-20180814T161306Z.tgz -rw-r--r-- 1 root root 45378897 Aug 14 16:09 octavia-20180814T160839Z.tgz -rw-r--r-- 1 root root 1879 Oct 30 16:10 packages -rw-r--r-- 1 root root 27186008 Apr 26 2018 swift-20180426T230541Z.tgzInstall the non-venv PTF packages on the Compute Node
ardana >cd ~/scratch/ansible/next/ardana/ansibleardana >ansible-playbook -i hosts/verb_hosts ardana-update-pkgs.yml --extra-vars '{"zypper_update_method": "update", "zypper_update_repositories": ["PTF"]}' --limit comp0001-mgmtWhen it has finished, you can see that the upgraded package has been installed on
comp0001-mgmt.ardana >sudo zypper se --detail python-neutronclient Loading repository data... Reading installed packages... S | Name | Type | Version | Arch | Repository --+----------------------+----------+---------------------------------+--------+-------------------------------------- i | python-neutronclient | package | 6.5.1-4.361.042171.0.PTF.102473 | noarch | PTF | python-neutronclient | package | 6.5.0-4.361 | noarch | SUSE-OPENSTACK-CLOUD-x86_64-GM-DVD1Running the ardana update playbook will distribute the PTF venv packages to the cloud server. Then you can find them loaded in the virtual environment directory with the other venvs.
The Compute Node before running the update playbook:
ardana >ls -la /opt/stack/venv total 24 drwxr-xr-x 9 root root 4096 Jul 18 15:47 neutron-20180718T154642Z drwxr-xr-x 9 root root 4096 Aug 14 16:13 neutron-20180814T161306Z drwxr-xr-x 10 root root 4096 May 28 09:30 nova-20180528T092954Z drwxr-xr-x 10 root root 4096 Aug 14 16:13 nova-20180814T161306ZRun the update.
ardana >cd ~/scratch/ansible/next/ardana/ansibleardana >ansible-playbook -i hosts/verb_hosts ardana-update.yml --limit comp0001-mgmtWhen it has finished, you can see that an additional virtual environment has been installed.
ardana >ls -la /opt/stack/venv total 28 drwxr-xr-x 9 root root 4096 Jul 18 15:47 neutron-20180718T154642Z drwxr-xr-x 9 root root 4096 Aug 14 16:13 neutron-20180814T161306Z drwxr-xr-x 9 root root 4096 Oct 30 12:43 neutron-20181030T124310Z <<< New venv installed drwxr-xr-x 10 root root 4096 May 28 09:30 nova-20180528T092954Z drwxr-xr-x 10 root root 4096 Aug 14 16:13 nova-20180814T161306ZThe PTF may also have
RPMpackage updates in addition to venv updates. To complete the update, follow the instructions at Section 13.3.1, “Performing the Update”.
13.5 Periodic OpenStack Maintenance Tasks #
Heat-manage helps manage Heat specific database operations. The associated
database should be periodically purged to save space. The following should
be setup as a cron job on the servers where the heat service is running at
/etc/cron.weekly/local-cleanup-heat
with the following content:
#!/bin/bash su heat -s /bin/bash -c "/usr/bin/heat-manage purge_deleted -g days 14" || :
nova-manage db archive_deleted_rows command will move deleted rows
from production tables to shadow tables. Including
--until-complete will make the command run continuously
until all deleted rows are archived. It is recommended to setup this task
as /etc/cron.weekly/local-cleanup-nova
on the servers where the nova service is running, with the
following content:
#!/bin/bash su nova -s /bin/bash -c "/usr/bin/nova-manage db archive_deleted_rows --until-complete" || :